I wrote:
--------------------------
We have twin SS1000 running 2.4 (Generic_101945-27) with NFS drives
crossmounted between them.
They have been exhibiting recently crises of very high load with the
kernel eating up most of the CPU cycles (70-95%).
The machines usually heal, but did crash a couple of times in the last
week.
>From /var/adm/messages
Jul 24 19:37:04 eesun2 unix: NOTICE: ufs_bmap: realloccg failed
Jul 24 19:37:05 eesun2 unix: NOTICE: ufs_bmap: realloccg failed
Jul 24 19:37:14 eesun2 last message repeated 67 times
...
and
Jul 24 21:19:41 eesun2 unix: NFS write error on host eesun1: error 49.
Jul 24 21:19:41 eesun2 unix: (file handle:
Jul 24 21:19:41 eesun2 unix: 800096
Jul 24 21:19:41 eesun2 unix: 2
Jul 24 21:19:41 eesun2 unix: a0000
Jul 24 21:19:41 eesun2 unix: 307ef
Jul 24 21:19:41 eesun2 unix: 2da6d634
Jul 24 21:19:41 eesun2 unix: a0000
Jul 24 21:19:41 eesun2 unix: 2
Jul 24 21:19:41 eesun2 unix: 6e008176
Jul 24 21:19:41 eesun2 unix: )
The other twin exhibits similar error messages [...]
Would someone please explain to me what do "ufs_bmap: reallocg failed"
and "NFS write error on host eesun1: error 49" mean and where such
messages are documented?
----------------------------------
Solution (mostly)
========
Casper H.S. Dik - Network Security Engineer (Casper.Dik@Holland.Sun.COM)
wrote:
>Sun broke 101945-27; ask them kindly for 101495-32 which fixes the NFS prob.
It's not specific 101945-27 brokeness. It's generic Solaris 2.4
quota brokeness. 101495-32 indeed fixes the quota problem.
...
Then again
...
1186805 other [] too many write error and EDQUOT messages from nfs to syslog
1175931 nfs loops on async write errors
(the first fix does away with the messages, the second fix makes sure
that NFS doesn't keeps on retrying to write-behind in case of filesystem
full/over quota errors)
The university I used to work for had a lot of problems with this.
Patch 101945-32 fixed it for them
....
Davin Milun, milun@cs.Buffalo.EDU, contributed that:
....
Actually, even running T101945-33 from Sun, we have still had one ufs_bmap
hard hang on our NFS server (the only time there was even slightly
significant load since we installed -33). The patch helped *a lot*, but is
still not prefect.
....
Hans van Staveren, sater@cs.vu.nl, wrote:
>cfs@cssun.mathcs.emory.edu (Charles Stephens) writes:
>> 101945-29 is now on sunsolve1.sun.com - released last Friday (8/4) it
>> appears.
>I would beg SunService for -33. It is a dream come true. NFS hangs
>have gone away and we can now use quota with out fear! -33 seems very
>stable.
We have a T version of -33. It does *not* solve the machine hangs when
users go over quota, or the file system fills up.
....
--------
This roughly summarizes our experience: T101945-33 from Sun tech support
does indeed completely removes the "NFS write error 49" and syslogd does
not get frantic trying to log a constant flow of errors.
The "ufs_bmap: realloccg failed" still visits us from time to time and
the kernel does go hyperactive when quotas are exceeded across NFS, but the
machines are stable.
As Davin Milun said, the patch is not perfect but it does help *a lot*.
My deep thanks go to those on this group who have shared their
expertise and experience:
Peter Bunclark, psb@lyra.csx.cam.ac.uk
Casper H.S. Dik, Casper.Dik@Holland.Sun.COM
Chris Peck, chris@zork.cc.binghamton.edu
Davin Milun, milun@cs.Buffalo.EDU
Hans van Staveren, sater@cs.vu.nl
Kurt Bertelsen, kurtbert@cray.com
-- alex khalil sysadmin iskandar@tamu.edu Electrical Engineering Dpt. office: Zachry 30G Texas A&M University voice: (409)845-7530 College Station TX 77843-3128 fax: (409)845-1556
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:31 CDT