original message:
>> Help.
>>
>> We have 2 4/490's running 4.1PSR_A. One (gracie) is a fileserver, the
>> other (george) is the main machine.
>>
>> George's kernel has the PMEG patch, but this problem appeared before
>> the patch went in.
>>
>> George also runs the Sybase SQL server 4.0.1.
>>
>> All user home directories are nfs mounted from gracie to george across
>> an FDDI cable, and we are running Sunlink FDDI 1.0.
>>
>> Occasionally, the following happens:
>>
>> george % ps arlx
>> F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME COMMAND
>> 80003 0 0 0 0 -25 0 0 0 runout D ? 3:50 swapper
>> 80003 0 2 0 0 -24 0 0 0 child D ? 0:14 pagedaemon
>> 80000 0 122 1 11 -15 0 24 0 Sysbase DW ? 44:30 update
>> 4019566 510 503 1 -5 0 0 0 Z ? 0:00 <defunct>
>> 204080019623 3847 3844 82 45 0 160 528 R ? 0:15 twm
>> 200884019561 2266 1154 37 -5 0 0 0 Z r3 0:00 <defunct>
>> 200080019618 3790 1500 0 -25 0 96 96 kernelma D r6 0:01 tar xvf /dev
>> 200000019606 3982 3628 14 28 0 208 504 R se 0:00 ps alrx
>> 200080018473 3972 3971 0 -25 0 440 752 kernelma D sf 0:00 ld -dc -dp -
>>
>> and eventually more and more processes end up as D or DW waiting at
>> kernelmap, while the cpu remains ~98% idle.
>>
>> netstat -m shows no unusual statistics, but a vmstat on gracie (the
>> fileserver) shows very little disk activity.
>>
>> I can sometimes "fix" the problem by bringing down the FDDI link, then
>> bringing it up ~30 seconds later (causing NFS timeouts).
>>
>> I, and our local Sun on-site support guy, have no new ideas on how to
>> fix this - I hope that someone out there can point me in the right
>> direction.
"Fuat C. Baran" <fuat@cunixf.cc.columbia.edu>,
and Ian Angles <ia@st-andrews.ac.uk>
suggested installing patch 100077, which fixes nfsd and biod hangs.
halstern@Sun.COM (Hal Stern - Consultant)
suggested looking at mbuf usage, and installing the mbuf workaround
patch. This patch is mutually exclusive with the PMEG patch (binary
patches only, of course)
kevin@Corp.Sun.COM (Kevin Sheehan {Consulting Poster Child})
mentioned the rpc.lockd problem. We use aliases in the /etc/exports
file shorter than 20 chars to avoid this problem.
era@niwot.scd.ucar.EDU (Ed Arnold) also sent a response.
Thanks to all who sent ideas. No one seems to have a clear idea on
exactly what is going on here. I've logged an official call to Sun,
so now it's just wait and see.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:59 CDT