SUMMARY Help - SunOs trouble

From: Pete Glassenbury (pete@cosc.canterbury.ac.nz)
Date: Tue Apr 28 1992 - 00:40:52 CDT


Hi Again,
        Here is a late summary of the replies I received. I haven't been
able to test any of the suggestions(some had already been done) as the
machine hasn't had the problem at all recently. This may have been
because a different kernel was made, but i'm not sure(extra ethernet
interfaces were added to the machine)
A couple of replies mention patches of various kinds. I have a lot of
kernel patches installed but I just ftp'ed the index and found
some have later version numbers --- Is this for "fixing the fixes" or
other sun machines??

Peter Glassenbury Computer Science dept.
pete@cosc.canterbury.ac.nz University of Canterbury
                                        New Zealand

Me> THE PROBLEM
Me> Every few days (and sometime 2 or 3 times the same day) the load
Me> on a machine (mainly one machine but has been others) goes through the
Me> roof. Things keep getting slower and slower until there is no response
Me> for anyone and a L1-A has to be used to restart the machine. (Around a
Me> load of 80 or 90). One time we got in early as the load was climbing
Me> (about 8) su'ed to root ; nice -10 and had a look. "ps" and "top" showed
Me> next to no user processes around and nothing that was using any large
Me> amounts of resources. A "pstat -T" showed files,inodes,processes and
Me> swap tables all less than 50% utilised. A "vmstat 1" showed no paging,
Me> (in fact no disk activity at all), next to no user cpu, but in the
Me> 90-100% range for system cpu.
Me> A "netstat 1" showed the ethernet was still going and that there were
Me> quite a few collisions but still a normal amount of traffic.
Me>

> From: reardon@sws.SINet.SLB.COM (Bob Reardon Maildrop 3F - x4794)
> Sparc1 systems had a serious memory management problem - I think
> Sparc1+ also had it. As I remember, it affected machines with more
> than some minimum amount of memory (16MB?). It was known as
> the 'pmeg' problem. Have you checked whether this is what you are
> experiencing?
I couldn't see this patch in the Index but I am certain we had this
problem quite a while ago and it was fixed then. I beleive it panic'ed
with a pmeg (Is that right?) -- Our machines don't panic -- they just
keep getting more loaded down.

> From: stern@sunne.East.Sun.COM (Hal Stern - NE Area Systems Engineer)
> this sounds like a bug in 4.1.1 that is caused by inode
> and page cache thrashing. what was your page attach (at in
> vmstat output) rate during this time? if it was > 20 or
> so, you're probably being stung by the ufs_inactive() bug.
> try patch 100259.
> without more details (what OS, what patches, what kind of client
> load, etc) it's hard to say more. do you have the NFS jumbo
> patch installed?
This sounded promising -- Our page attach fluctuated between 40 and 170
(never below), but our normal heavy running shows the "at" field with
between 20 and 70 or 80.
I do have 100259-01 installed (The Index for 4.1.1 at princeton.edu
doesn't list it though)
We are running SunOs4.1.1 with heavy client load of Xterminals that boot
from that machine and a couple of ELC boot and swap to it.
We have the following patches
100075-06 100179-01 100211-02 100243-01 100265-01
100126-05 100185-01 100216-01 100250-01 100273-01
100149-03 100188-01 100225-02 100254-01 100293-01
100159-01 100192-01 100228-02 100255-01 100343-01
100173-03 100198-01 100232-01 100259-01 100376-01
100174-01 100199-01 100233-01 100262-01

100173-07 is the NFS Jumbo Patch --ours is -03
>
> Date: Wed, 15 Apr 92 9:47:19 EDT
> From: etnibsd!vsh@uunet.UU.NET (Steve Harris)
> We've had problems involving sendmail and automount. Not clear to us what
> the exact problem is, but here is a synopsis:
>
> client sends mail to main server
> server sendmail attempts to examine .forward file in
> recipiant's home directory
> has to automount recipiant's machine
> machine does not respond (who knows why)
> automount blocks, sendmail blocks, retries take place
> server slows down, other sendmails not serviced, more retries
> -> thermonuclear chain reaction -> meltdown
>
> The clue is to look for several sendmail daemons running. The solution
> is to reboot the recipiant's machine.
> Good luck, please summarize.
We didn't have any process taking up time

>
> From: Rick Dipper <rick@computer-science.manchester.ac.uk>
> sounds like a ethernet problem to me.
> We often have "broadcast storms" where the ethernet load goes
> through the roof & slows all our machies down.
> You need to contact your network guru, our problems are useually
> caused by students being "clever" with PC's
Unfortunately the gurus around here are us :-)
The ethernet load kept going but it wasn't extremely high -- stayed
about normal.(not all machines slowed -- only the one server and all
clients serving off that)
>
> From: phillips@athena.Qualcomm.COM (Marc Phillips)
> You should definitely look into Sun patch number: 100330-05
> This is a jumbo patch which deals with many processes getting
> stuck in disk wait states (causing load to skyrocket), and it deals
> with other kernel problems.
We didn't have any disk wait problems. (I couldn't see any like that)
but we DON'T have that patch installed --- I'll get it and install it.
>
> From: oran@spg.amdahl.com (Oran Davis)
> We have seen this happen with clients that loose their server windows.
> Even mwm can hang without dying and hog the CPU while listenning for
> response
> from its server.
> You can scan for these runaway processes by asking all to log out. Any
> leftover
> process is the culprit. Beats L1-A.
We didn't have any process taking up time

>
> From: ballisti@ifh.ee.ethz.ch Raymond Ballisti (Ray),
> I remember having had a similar problem years ago with 3/50 and the old
> system 3.5. If you did not solver the problem inbetween (which I hope
> for you)
> , try to look at the syslog daemon ( and use ps with the x options to
> see it ).
> I remember that it took most of the cpu time. I do not remember the
> solution
> (I should look in my old logs ..) but I think it was an entry in
> /etc/syslog.conf
Wasn't taking up time -- We do have the X terminals(Sun3/50 runing
xkernel and sunos4.0.3 kernels) running the syslogd -- I'll look into
that side of it to see that they are not doing something stupid and
loading down a server with messages through the ethernet.
>
> From: shipley@tfs.COM (Pete Shipley)
> NFS
>
> From: Patrick Lynn Shopbell <pls@pegasus.rice.edu>
> Pete,
> Here's a similar problem that causes this effect - has
> to do with NFS clients getting stuck with stale file handles.
> I don't know if this is your problem, but perhaps worth a look.
> What follows are a couple of messages posted here a while back.
>
> Good luck!
> Patrick
> *----------------------------------------------------------------------------*
>
> > Posted-Date: Fri, 13 Dec 91 16:59:44 CST
> > Message-Id: <9112132259.AA03967@kate.as.utexas.edu>
> > Received: by kate.as.utexas.edu (5.65/1.00)
> > id AA03967; Fri, 13 Dec 91 16:59:44 CST
> > To: sun-managers@delta.eecs.nwu.edu
> > Subject: Re: Nfs daemons go comatose
> > Cc: dpw@kate.as.utexas.edu
> > Status: OR
> >
> > I recently wrote:
> >
> > > Our 4/490 running 4.1.1 (with various patches) has been experiencing some
> > > seemingly random NFS hangups recently. The symptoms are always the
> > > same: all exported file systems disappear from remote mounts, although
> > > the server still thinks they are being exported; all attempts by clients
> > > to either mount or access already mounted file systems result in "NFS getattr
> > > failed for server: RPC: Timed out"; the load average on the server becomes
> > > artificially high, not reflective of the actual job load - as if the kernel
> > > was hung trying to handle some kind of internal event. Ps -aux reveals
> > > all nfsd processes in the same state:
> > >
> > > > root 107 0.0 0.0 40 0 ? D 17:19 0:00 (nfsd)
> > > > root 104 0.0 0.0 40 0 ? D 17:19 0:00 (nfsd)
> > > > root 112 0.0 0.0 40 0 ? D 17:19 0:00 (nfsd)
> > > Once this situation occurs, a reboot is necessary to restore normal NFS
> > > functionality.
> > >
> > > Consulting a list of patches I got from this mailing list recently, I
> > > found the following:
> > >
> > > > 100040-01 NFS daemons can hang in a DW wait state.
> > >
> > >
> > > which sounds like what's happening on my machine. I'm currently looking
> > > for this patch, but haven't been able to find it on any of my usual sources.
> > > Has anyone out there installed it ? It occurs to me that it might be
> > > included in a more recent `combo' patch (i.e. NFS Jumbo Patch, 5 Nov 91),
> > > but there's no way to tell since all fixed problems are listed in the
> > > README by bugid and not cross-referenced by patch number. None of the
> > > bug descriptions seem to match.
> > >
> > > Or maybe the above patch doesn't even pertain here. If anyone listening
> > > has any insight into this problem, I'd be pleased to hear from you.
> >
> > The source of this problem turned out to be an NFS client repeatedly getting
> > hung trying to write on a stale file handle. There's no indication of
> > of a problem with the server's disk in the messages file, but NFS seems to gag
> > when this client writes to certain files on it (probably would happen with
> > other clients, too, but only this particular client mounts it read/write).
> > Occasionally after a crash we find a large amount of stuff on the disk
> > inexplicably trashed, and it's always the same stuff. I'm tempted to try
> > installing the patch(es) suggested by some of those who responded to my
> > plea for help, but with the release of 4.1.2 only a few weeks away and a
> > bunch of other stuff on my list to be done, I think I'm just going to wait
> > till I install 4.1.2 and see if the problem still exists. Meantime, we'll
> > just avoid the situation that creates this problem.
> >
> > In the latest instance of the problem described above, rebooting the client,
> > then the server, cleared it. The server has been running for 3 days now,
> > and all 8 nfsd's are still viable.
> >
> > Sid Shapiro @ ingres.com had good advice which enabled me to figure out where
> > the problem was originating, although we were unable to kill the hung program
> > on the client as he suggested, and hence had to reboot it. Here's his
> > response:
> >
> > Date: Wed, 11 Dec 91 09:16:06 -0800
> > From: sid@ingres.com
> > Status: RO
> >
> > there are two reasons that I am familrier with for nfs damones going
> > into D state. One is that a local disk has "gone away". The other is
> > that there is a run-away process on a client that has a bad NFS file
> > handle.
> >
> > You can tell the difference by looking at nfsstat. Running nfsstat a
> > few times - if the number of attr and lookup and read calls goes up
> > very quickly, then some client is banging on a probably bad NFS file
> > handle. You can uses etherfind to see which client it is - look at
> > the client, find the process in "D" state there and kill it.
> >
> > If it is a local disk on the server that has gone south you ought to
> > see message about it on the console or the messages file, and nfsstat
> > won't show dramatic increases.
> >
> > Good luck.
> > --
> > Sid Shapiro -- Ingres, an ASK Company
> > sid@ingres.com (415)748-3470
> >
>



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:41 CDT