SUMMARY: NFS File copies mysteriously slow down

From: Granzow, Doug (NCI) <granzowd_at_mail.nih.gov> Date: Mon Feb 11 2002 - 14:54:13 EST · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:33 EST

Thanks to Don Mies and Jay Lessert for their suggestions.  I have narrowed
down the cause of the slow down but I haven't entirely solved the problem.
I'm still hoping someone might have some good information about NFS to help
me out.  See below for a further description of my problem.

Don reminded me of the auto-negotiation problem that sometimes occurs on
network interfaces (especially with Suns connected to Cisco equipment).  I
checked all of the NICs and ports, and everything was configured for 1000
mbps, full-duplex.

Jay wrote me with this:

>You don't give us any details about the file systems on the destination.
>
>If the destination file system is generic, default Solaris ufs with no
>NVRAM cache used in the disk controllers, the operation you describe
>will depend on the data:
>
>    - If small numbers of very large files: very fast.
>
>    - If large numbers of very small files: very, very slow.
>
>In this case the culprit is synchronous file creation, and you can't
>even fill a 10BaseT pipe.
>
>One solution is fastfs (http://www.science.uva.nl/pub/solaris/fastfs.c.gz),
>which would be the very fastest, but requires some knowledge and care on
the
>part of the admin.  You would run in fast mode only for the duration of
>the copy.
>
>Another solution is to turn on logging in the destination vfstab.
>
>If, on the other hand, your're already running VxFS and the controller
>has 256MB of NVRAM, then I'm out of ideas.  I assume you've checked out
>the patch situation on both boxes already.

Good information, but as it turns out my destination filesystem is VxFS and
the dual controllers both have 256MB of NVRAM.

A key point in my original message was at the end, where iostat -x showed
that the nfs "device" was 100% busy.  (The source is an nfs-mounted volume.)
I checked the NFS server and it was running with "nfsd -a 64".  I restarted
nfsd with "nfsd -a 128" and saw an immediate, significant speed increase.  I
also saw iostat -x start to show the destination drives pegged at 100% busy,
which is what I would expect (writes should be the slower than reads).

But, after a while, iostat showed nfs hitting 100%b again, and the speed of
copies dropped significantly again.  I'm now running "nfsd -a 256" and I'm
wondering what this should be set to.  I found some Sun documentation that
suggested using 16 threads for 10 mbps of bandwidth, which would be "nfsd -a
1600" for a gigabit network.  That's a big jump from the original 64 though.

So the questions I have at this point are:

- Is there a downside to a high number of NFS threads?  (Memory usage, etc.)
Would "nfsd -a 1600" be reasonable or is that too high?  (It is an E420R
with 4 CPUs and 4 GB memory)

- Is there a way to tell how many NFS threads are in use?

- If 128 threads was not enough, why did take over an hour for NFS to hit
100% busy?  (The nfs server is currently being used *only* for this copy
operation.)  Is something else going on here?  Maybe idle threads are not
being released quickly enough?  Is there a timeout parameter that can be
tuned for this?

- Anyone know of some good up-to-date information on NFS performance tuning?

I'll re-summarize.  Thanks. :)
Doug

-----------------
Original message:

I am trying to copy a large amount of data (about 500 GB) from a Solaris 7
server ('servera') with 3 A1000s to a Solaris 8 server ('serverb') with a
Compaq storage array.  I am doing this by NFS mounting the file systems from
servera on serverb over a gigabit ethernet link -- both servers plugged into
the same switch.

I start up several cpio commands to copy the files from the nfs mounted
filesystems to their destination on the Compaq array.  Initially "netstat -i
60" shows over 300,000 packets per minute going across the wire (there is no
other network activity, just the nfs traffic and my ssh sessions to run the
cpio commands).  I left it to run overnight, and this morning "netstat -i
60" is showing 30,000 - 40,000 packets per minute -- a 90% decrease.  (Also,
iostat -x 15 on servera showed anywhere from 6,000 - 9,000 kr/s yesterday,
and this morning shows about 600 - 700 kr/s.)  None of the cpio commands I
started have finished.  They have not stalled either, but they are going
very slow now.

Both servers have load averages below 0.10 (both have 4 CPUs), and top shows
CPU as > 90% idle.  iostat -x on both servers show %b < 10 for all devices
*except* for the following on serverb:

                  extended device statistics
device       r/s    w/s   kr/s   kw/s wait actv  svc_t  %w  %b
nfs1         0.0    0.0    0.0    0.0  0.0  0.0    0.0   0   0
nfs2         0.9    0.0   23.5    0.0  0.0  0.9  942.1   0  46
nfs3         2.2    0.0   67.6    0.0  0.0  1.4  619.1   0  64
nfs4        18.7    0.0  559.9    0.0  8.6 12.4 1128.2  94 100

So nfs appears to be holding things up (100%b, svc_t over 1000, wait is
8.6), but why?  nfs does not appear to be fully utilizing CPU, disk, or
network, so what is slowing it down?  Is there anything I can do to get this
back up to the speeds I was seeing when it started?

Will summarize, thanks in advance.

Doug
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers