A few days back I asked
> We are having problems with some processes going to disk wait state
> (ps -aux shows status D) on our SUNs (4/280, 4/330 or 3/50's). The
> processes most frequently affected are the nfs daemons and other daemon
> processes. When a process is in the disk wait state, it cannot be
> killed either. The only solution seems to be to reboot the system
> to make it usable. Is there a better solution than this? It is
> difficult to convince the users on the system that it has to be rebooted
> in the middle of the day because some process is hung.
and I received a few pointers which suggested that there is a sun patch
tape and this bug has been reported earlier.
Here is a brief summary:
The bug ID's are 1017518 and 1017893 and a description of these given by
Dennis Michael <dennis@jessica.Stanford.EDU>
Occassionally on NFS server machines the nfsd daemons have been
reported to get into a disk wait ("DW") state as noted in a
listing of "ps aux". The result of this condition causes
all client requests to the server to fail. Problem descriptions
reported in Sun bugId's 1017518 and 1017893 identify at least two
distinct different causes of this problem, described below:
Case 1017518:
On the server system, processes go into DW state
and don't return. This problem is related to VM
and may happen even in non NFS instances. The
core dump will show _sleep, _cv_wait, _page_cv_wait,
and _page_wait at the top of the stack trace. Basically
the process is blocked waiting for the keep count on the
page it wants to go to zero (meaning that it is available)
but somehow it didn't get decremented correctly and will
never go to zero.
Case 1017893:
This is a server problem similar to the client problem
in bugId 1018954. The process is blocked waiting for an
mbuf structure to be released back to NFS, but it is
never being released. The core dump for this problem
shows the hung process with a stack trace of _svc_sendreply,
_svckudp_send(0x7hexdigits,0x7hexdigits) + 2C, _sleep.
The routine svckudp_send is trying to send a reply to the
client, but is blocked waiting for the mbuf structure
pointed to by the first 0x7hexdigits argument above.
Actually, the first 0x7hexdigits argument to svckudp_send
is a SVCXPRT pointer, not an mbuf. However, it's possible
to derive the mbuf's address given this argument.
There currently are two patches available for this case:
1) an adb patch which sets nfsreadmap to 0:
# adb -w /vmunix -
nfsreadmap?W 0
$q
This eliminates most of the code that increments and
decrements the keep count.
2) The included patched ufs_bmap.o files which fixes a
bug in bmap() where "softlocked" were never released after
failing to extend the original block.
Both patches may not be necessary. It is recommended that
the ufs_bmap.o patch be tried first before the adb patch
is also used.
SUN has a patch tape called "nfsd_dw_hanging" and I have requested
for the same and hopefully the problems should disappear once I
install the patches.
Thanks to:
Richard Elling <relling@eng.auburn.edu>
Chris Barry <cbarry@BBN.COM>
rackow@antares.mcs.anl.gov
Dennis Michael <dennis@jessica.Stanford.EDU>
Rob ten Kroode <roberto@cwi.nl>
halstern@Sun.COM (Hal Stern - Consultant)
for some useful pointers.
ram
--------
Janakiram Cherala Internet: ram@cs.orst.edu
Sun System Administrator UUCP :
Computer Science Department UUCP : hplabs!hp-pcd!orstcs!ram
Oregon State University, Corvallis, OR 97330 (503) 737-3273
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:56 CDT