SUMMARY: Extremely high load averages!

From: Systems Administrator (sysadmin@astrosun.tn.cornell.edu)
Date: Thu Mar 07 1996 - 14:14:53 CST


Hi.

Thanks to all for the responses. Although I haven't solved the problem
yet (since the condition hasn't happened recently) I got a lot of good
info.

My original post:
I have been experienceing extremely high load averages on one of my
servers.

The server is a sparcserver 20 running SunOS 4.1.3_U1. It is the file,
print, email and NFS server for approx 30 or so Sun workstations.

Every once in a while I wil get a message from the users telling me that
things are EXTREMELY slow. I'll do an 'uptime' on the server and get
the following:

3:57pm up 96 days, 43 mins, 3 users, load average: 12.06, 10.09, 8.27

What is the load that is being averaged? What could cause this? What
should I be looking for and how?

The normal average is:
11:20am up 98 days, 20:06, 1 user, load average: 0.21, 0.09, 0.01

I have been checking the running processes at the time when things are
'normal' and compairing it with the times when ther are high averages.
Nothing looks out of the ordinary. Top shows nothing strange. I just
learned about checking the ethernet packets to see if anything is going
on there.

Is there anything else I can do? We ARE seeing a degredation of the
network speed.
------------------------------------------------------------------------

To summarize:
There is a lot of good info included below. What I am doing is writing
a script that will run several programs and log the results over time so
that I can compair and see where my problem lies.

The commands I am including in this script are:
ps;vmstat;iostat;nfswatch(see the included message for
location);nfsstat;uptimenetstat -i;etherfind.

some of this may be overkill but I want to make sure I get all of the
info I need.

Hope this helps. Thanks to the following:
------------------------------------------------------------------------
From:Mitch Patenaude <mrp@hilbert>
Load average is a poor measure of system load. It is the average number
of
processes waiting to run when a time-slice becomes available. (The 3
numbers
you saw are averaged over 1, 5 and 15 minutes respectively, so you have
an
idea if this is a short-term, or long term phenominon). Although load
averages that are consistantly very high indicate that the system is
overworked, low load averages don't necessiarily correspond with quick
system
performance, and high load averages don't always indicate slow system
performance.

A much better way to check system performance is using the 'vmstat(1m)'
command. The command

% vmstat 5

Will produce a set of system performance statistics at 5 second
intervals
until interrupted. The first line is always garbage and should be
ignored.
In subsequent lines, check the columns labeled fre (free memory), po
(page written out), us (user cpu usage, ads % of total), sy (system cpu
usage, as %of total) and id (idle cpu time, as % of total).

Things to look for are a large number of pages being written out (po), a
small value for fre (free memoary), and high values of system cpu usage
with now idle time. These are indicators that you need for physical
memory, since the cpu time is a large portion of it's time writing out
and reading in pages from swap. Page outs are a better measure of
thrashing (heavy swap usage) than page ins because page-ins are also
created by new processes being forked.
If fre is reasonable, and po is small, but user cpu usage is high, then
you have enough memory, but the system is being overworked
computationally. Also check the command iostat to see if disk/serial
i/o is causing problems.
------------------------------------------------------------------------
From: Michael Blandford <mikey@lanl.gov>
The amount of NFS traffic seems to really add to the 'uptime'
value but even top want show it because it isnt CPU bound.
------------------------------------------------------------------------
From: ukcphmr@ukpmr.cs.philips.nl (Mike R. Phillips 3788)
I expect your problem is a rogue NFS client reading and writing
large amounts of data. Get a copy of 'nfswatch' off the network
this will allow you to see which NFS clients are hitting which
file systems on your server. 'nfswatch' has dug me out of a number
of similar situations to yours.
------------------------------------------------------------------------
From: John Cheshire <john@stellarperformance.co.uk
Do you have jumbo kernel patch 101508 installed? This patch fisxes
kernel
stability problems, but it has introduced a performance problem that can
take machines to high cpu load when not a great deal is happening. Sun
have released a 101508-11T as far as I believe. The T means that it is a
temporary patch. Otherwise, I have fixed a similar problem by taking
the vm_hat.c from a 4.1.4 patch and compiling it in to the SunOS
4.1.3_U1 kernel.
------------------------------------------------------------------------
From: Casper Dik <casper@holland.Sun.COM>
NFS servers run a number of NFS daemons. Typically such high load
averages
nmean that someone is doing something bad to the server, this can be
anything
of:
        - running find (build a fast-find database instead)
        - writing a large file over NFS.

Get NFSwatch: harbor.ecn.purdue.edu:/pub/davy/nfswatch4.3.tar.gz
and see which client is bombarding you.
------------------------------------------------------------------------
From: Peter Hesse <hessep@gb.swissbank.com>
The load average is the number of processes in the run queue averaged
over 1, 5, 15 minutes. As you still average 8 over 15 minutes, this is
not a transient; something is hitting the system hard every so often.

You might consider watching system performance with vmstat 1 (1st
column is the number of processes in the run queue) or top -s 1.

Vmstat helps spot things that happen periodically; that might lead you
to a cron job or something sleeping in a loop that bursts into life and
spawns loads of kiddies.

Top shows the "last pid:". Watch this to see if something is spawning
processes quickly or if a similar set of processes appear; grep them
from a ps aux to find the parent.

I have enclosed a sed script you might find useful; follow the
instructions in it running ps during a quiescent period and again when
loading is high. It should yield output something like:

         F UID PID PPID CP PRI NI SZ RSS WCHAN STAT TT TIME
COMMAND
< a0088001 751 104 103 0+80 272 select - S q0 0:10 cmdtool
-rv
> a0088000 751 104 103 0+80 0 select - IW q0 0:10 cmdtool -rv
< 20008001 751 1749 1738 0+28 140 Sysbase - S q1 0:00 less
perf.sed
> 20008000 751 1749 1738 0+28 0 Sysbase - IW q1 0:00 less perf.sed
< 20088001 751 8298 1 0+60 792 select - S co 1:38
mailtool -rv
> 20088001 751 8298 1 0+60 2180 select - S co 1:40 mailtool -rv
< 20008001 751 16147 1 0+92 256 select - S co 0:30
sv_xv_sel_svc
> 20008000 751 16147 1 0+92 0 select - IW co 0:30 sv_xv_sel_svc
< a0088001 751 29142 29141 0+80 284 select - S pd 0:05 cmdtool
rsh xxxxxxx
> a0088000 751 29142 29141 0+80 0 select - IW pd 0:05 cmdtool rsh yyyyyy

It strips out relatively uninteresting processes and shows processes
which differ between the 2 ps runs, sorted so you can easily spot:

   1. If something is eating your CPU. (Interesting)
   2. New processes. (Perhaps interesting)
   3. Processes which died. (Boring.)

It should work equally well if you run the reference ps under heavy
load and the comparison ps when the load eases.
#
# To use this, create a reference list of processes at a time when
# performance is good, but as soon before a time of poor performance as
# possible:
#
# ps axlww > /tmp/wph.ps1
#
# At a time of poor performance, create a new process list and find
# the differences between it and the reference list:
#
# ps axlww > /tmp/wph.ps2
# diff /tmp/wph* | sed -f ~hessep/sysadm/perf.sed | sort +3n
#----------------------------------------------------------------------

# Reinstate the header removed by diff.
# Note '\' to escape leading blanks which would otherwise be stripped.

1i\
\ F UID PID PPID SZ RSS WCHAN..... STAT TT TIME COMMAND

# Remove uninteresting diff output: Separators, line numbers.

/^---/d
/^[0-9]/d

# Remove uninteresting processes (we hope).
# csh, nfsd...

/(.*)/d
/in./d
/rpc./d
/mount/d
/update/d
/portmap/d
/sleep/d
/cron/d
/swapper/d
/xterm/d
/pagedaemon/d
/ypserv/d
/ypbind/d
/sendmail/d
/top/d
/snmpd/d
/vmstat/d
/ vi /d
/ ps /d

# Discard CP PRI NI.

s/^\(.............................\)........./\1+/

# Separate the flag field, F, UID and PID for the sort. UID is 1-5
chars.
# ps aligns 1-4 char UIDs but offsets 5 char. UID can abut 5 char PIDs.

s/^\(..........\)\( *[0-9][0-9][0-9][0-9][0-9]\) /\1 \2 /
t wchan
s/^\(..........\)\( *[0-9][0-9]*\) /\1 \2 /

:wchan

# Guarantee a WCHAN field - it can be blank. Append "-" to it, just in
case.

s/^\(...............................................\) /\1-/
------------------------------------------------------------------------
From:mshon@sunrock.East.Sun.COM (Michael J. Shon {*Prof Services} Sun
Rochester)
The load average is simply the number of runnable jobs in the queue.
These are jobs that are not waiting for anything besides a chance
at the CPU. The third number (8.27) is the 15-minute average, so it
shows that the system has been Very busy more or less continuously
for at least the last 15 minutes.

Your server has a lot of different things to do as a "
file, print, email and NFS server",
so it may be hard to figure out what the unusual load is from.

However, my guess is that it is just a result of a hefty NFS load.
All of those nfsd processes have work to do and they all want to
do it right now. This sort of thing would be seen during a major
software build; it might run for 30 minutes or more and keep
a lot of nfsd processes hopping. Nothing would look unusual in the
process list.
Check for rapid rise in the rate that CPU time is added to the nfsd's .
That will pin it on NFS.

If the users tell you that it seems slow, you should look at your
network traffic levels and see if there is any bandwidth left.
You might need to add more ethernet interfaces and move some users
to another ethernet segment in order to improve things.
[This assumes that your NFS server can support even more NFS traffic
and that it is the wire itself that is limiting you.]
------------------------------------------------------------------------
From: bfr7kq6@is000913.BELL-ATL.COM (Boss)
The load average is not really an average at all. It is a count of the
average number of jobs waiting to be processed. So in the above
example,
you have an average of 12 processes waiting to be serviced by your
CPU, and the users say its slow b/c their jobs are queued up as well.
Running top might not show your problems b/c top just shows the top
CPU-usage jobs...they are not always your problem.

Counter example: i had an X-windows program blow up and pop-up hundreds
of xmessage boxes on my screen....well after a while the system couldn't
handle the requests and stopped doing them....my load average was
300+! I had 300 xmessage windows waiting to pop up on my display...
however i could type at the console and do a ps just fine.
'd do the following: get a ps -fe | sort during this time...this will
sort processes by user name...look for a ppid spawning a bunch of junk.
Do you run some sort of hog network management stuff? Is someone
developing on the machine and blowing up their job?

A slow machine w/ no apparent top usage process usually points to a
paging or swapping problem...use commands like swap -s, vmstat -S,
and sar -r to monitor your virtual memory and swap space. Look at
the man pages to see how to use these commands....look for
blocked processes...that sort of thing.
------------------------------------------------------------------------
From: manjeet@iglou.com (Manjeet Rekhi)
Check if some remote system is NFS mounted and is not responding.
Automounter is (in)famous for such occurences.
------------------------------------------------------------------------
From: Steve Phelps <steve@epic.co.uk>
Disconnect it from the network (for a few seconds) when load average is
high
and see if load drops drastically. This way you know if the problem is
caused
by client activity or if it is something running locally on the server.

Run

        host% vmstat 1

and see what the values of 'sys' and 'usr' are when things start slowing
down.
If 'sys' is higher than user when it is slow but 'sys' drops rapidly
when
you unplug the network connection then it is probably NFS activity that
is the problem- though it is probably a bit early to conclude that.

Best thing is if you try the above (especially vmstat) and repost to the
list.
------------------------------------------------------------------------
From: vahsenr@ce.philips.nl (Vahsen Rob)
This is not so extreme. If the machine is your NFS server and you still
got reasonable good performance on logins to it, your Ok.
You could decrease your nfsd's but that will decrease your total
troughput of the whole NFS-domain.
You could also improve the performance by installing a Presto board.....

The load on our NFS server (SS 10/40, SunOS 4.1.3_U1) is:

  5:22pm up 3 days, 22:30, 2 users, load average: 22.54, 20.07, 19.71

It can go up to 26 or so easily.......
------------------------------------------------------------------------
From: "Michael J. Freeman" <freeman@kutztown.edu>
The load averages in uptime show the average number of jobs in the run
queue for the last 1, 5 and 15 minute time intervals. It's hard to say
exactly what is causing the load to be so high, but you may want to
check
processes which are getting an extreme amount of CPU time when this
happens. Those would most likely be the source of your problems.
------------------------------------------------------------------------
From: Alexander Finkel <afinkel@pfn.com>
The only idea that comes to mind is memory or NFS loading. How much
physical memory does the SS20 have and how much swap is configured?

Check nfsstat and iostat output to see if there is a lot of NFS and/or
disk activity. 30 NFS clients may be overloading the 20 if they are all
doing NFS activity at the same time. Also, if the system is doing a lot
of swapping and trying to handle a lot of NFS at the same time, you
might see the slowdown and the high load averages.

I recommend you get a copy of "Sun Performance and Tuning" by Adrian
Cockcroft. I have found it very helpful in the past. In the book, on
page 97, the load average is defined as "...the sum of the run queue
length and the number of jobs currently running on CPUs."
------------------------------------------------------------------------
From: Torsten Metzner <tom@plato.uni-paderborn.de>
(1) The SS20 is your email and nfs server. No suppose at every
workstation
    someone reads his mail. This could dramatically increase the load
    especially if there is heavy nfs activity on the servers disks.
    How many nfsd daemons are running on your machine ?
    How many file systems are exported on your machine ?
    How many users are working on the 30 machines ?
    Are every 30 clients are on the same physical net and what 's the
    configuration of the client. (local / and /usr filesystem , any
    diskless clients ) ?
    How many collisions are on your network ? Try: netstat -i
    If the collision rate (#Collis/Opkts) is higher then 10 % there
    are to much traffic or your server is to slow.
    Other people say that a collision rate > 5% will be critical but
    I think < 10 % will be OK. But that's only my opinion and it's
    subjective.

    We had only a SS10 as our email and nfs server and during the
    working ours we sometimes had the same problems. Ok our SS10
    was a email server for more ~ 80 machines {;-) But it was an
    nfs server for only 20 machines !
 
(2) If there was some sendmail problems the load could become high,
because
    then sendmail tries to deliver the saved mail during a short
    perid of time. What is the configuration of your sendmail, e.g.
    the value for:
    
# load average at which we just queue messages
#O QueueLA=8
 
# load average at which we refuse connections
#O RefuseLA=12
 
   
(3) Eventually there are some hardware problems.
    How many errors are on your network ? Try: netstat -i
------------------------------------------------------------------------
From: kwthomas@wizard.nssl.uoknor.edu (Kevin W. Thomas)
High loads on a server for Solaris 1.x, and probably Solaris 2.x are not
unusual. Just about all the load is due to NFS activity. I've seen
load averages above 25 on my 4/670 model/41 with 4.1.3. The client
workstations seem to be reasonably fast even with high server loads.

Probably culprits include:

o Users compressing or uncompressing big files.
o Large core dumps.
o Someone running "find".

If used the "etherfind" command to find the offending workstation(s)
when the load is high over an extended period.

You might increase the number of "nfsd" processes.

You might also want to look at Hal Stern's book "NFS and NIS".
------------------------------------------------------------------------
From: The Friendly Sysadm <lindqt@space.se>
The first thing I would check, is if there are any mail-exploders hiding
behind this.
Email can very fast totally sink a machine if there is a lot of email
to be sent at the same time.
------------------------------------------------------------------------
From: Glenn.Satchell@uniq.com.au (Glenn Satchell - Uniq Professional
Services)
I'd bet that it's the NFS daemons all running that are pushing up the
load average. Also, since the nfsd's run with kernel priority they will
run ahaead of _any_ user processes and there's no way to change that.
This is why you'll sometimes hear the recommendation that you should
dedicate your NFS server and not allow users to log into it to run
things.
------------------------------------------------------------------------
From: Rachel Polanskis <rachel@juno.virago.org.au>
are you running a WWW server?
If so is it CERN?

CERN has a spiral of death bug where it will keep spawning off children
until
it eats all your memory and then slows to a halt.
This is a gradual process.

It would also be advisable to see how many hung processes are around.

I have a perl script which was broken and one day I looked and there
were 5 or 6
hung processes attached to it.

These should have died several days before. They pushed the load avg
up a lot,
but the bottleneck was cleared as soon as I killed them off.

Try looking over your process table for lost children ;)
------------------------------------------------------------------------
From: Kevin.Sheehan@uniq.com.au (Kevin Sheehan {Consulting Poster
Child})
It's probably the fact that 4.x is single threaded in the kernel. You
might
want to think about moving to 5.5...

                l & h,
                kev

PS vmstat 5 would be instructive - what is the paging (sr) rate?
------------------------------------------------------------------------
From: james mularadelis <jamesm@matrix.newpaltz.edu>
Did you check to see what systems are accessing this workstation?

I've had mail come in at peak times and hammer the system.. Seems some
mailing lists decide to send stuff to me all at once.

Also, check and see what's going on with the systems that feed of this
workstation. Are there many logins at once. You might be getting
peak loads when several dozen people login at the same time or within
a very short interval of each other.

We get this problem between classes when several dozen people login
quick to check their mail and then log off.
------------------------------------------------------------------------
From: roland@netcom.com (Paul Roland)
Vic,
        Good, you should check process use, but 'network' slowdown can
also
be caused by bottlenecks in disks on the server (try iostat with
interval).
Also, swapping can be a problem (perfmeter and/or pstat -T can help).

        You should check packets, but as a basic check, see if your HUB
(are you using 10-Base-T?) is showing lots of collisions, or heavy use.
You may have a broadcasting problem, or bad port throwing out lots of
trash.
------------------------------------------------------------------------
From: "John A. Murphy" <jam@philabs.research.philips.com>
If you're NOT running NIS there is a bug related to the domain name.
Try
varying the domainname from "noname" to ""

domainname ""

or if it is that already, try

domainname "nobody"
------------------------------------------------------------------------
Thanks again!

-- 
***************************************************************
                      Systems Administrator
                      ---------------------
                   Space Sciences Building CRSR
   Mail all system related problems to one of the following:
sysadmin@astrosun.tn.cornell.edu   root@astrosun.tn.cornell.edu          
sysadmin@spacenet.tn.cornell.edu   root@spacenet.tn.cornell.edu
                              or see 
Vic Germani in room 402         germani@astrosun.tn.cornell.edu
***************************************************************



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:55 CDT