SUMMARY: NFS write performance too slow

From: Phil Blanchfield (phil@dgbt.doc.ca)
Date: Fri Nov 20 1992 - 06:34:22 CST


Fellow Sun Managers:

        Last week I posted to this list regarding slow NFS write performance
        on SUN systems configured as NFS server systems. Here is a snipit to
        jog your memory (The complete request is included with the responses
        below):

        --------------------------------------------------------
        I have been measuring NFS write performance and I have found that
        we rarely get better than 100k bytes/second from client to server.
        Using rcp I get more than 500kB/s with ftp often better than 200K.

        As I understand it the NFS protocol requires that NFS writes are
        synchronous. The server has to write the data block to the physical
        disk before it can acknowledge. This understandably will hinder NFS
        write performance, however, 100k still seems much too slow.

Q1. -> So where is the bottle-neck?
        --------------------------------------------------------

Thank You:

From: blymn@awadi.com.AU (Brett Lymn)
From: pla_jfi@pki-nbg.philips.de (Karl-Jose Filler)
From: "(Cengiz Oezcan-Barlach)" <coez@tet.uni-hannover.dbp.de>
From: Charles A Finnell <finnell@portia.mitre.org>
From: scs@lokkur.dexter.mi.us (Steve Simmons)
From: rasesh@il.us.swissbank.com (Rasesh K. Trivedi)
From: poffen@sj.ate.slb.com (Russ Poffenberger)
From: dan@bellcore.com (Daniel Strick)
From: srm@shasta.gvg.tek.com (Steve Maraglia)
From: baumeist@vsun04.ag01.Kodak.com (Hans Baumeister)
From: shj@ultra.com (Steve Jay {Ultra Unix SW Mgr})
From: David Wheable <djw@advanced-robotics-research-centre.salford.ac.uk>
From: todd@flex.eng.mcmaster.ca (Todd Pfaff)
From: ups!upstage!glenn@fourx.Aus.Sun.COM (Glenn Satchell)
From: Alan J. Rothstein <merccap!alan@uunet.UU.NET>
From: jk@leo.tools.de (Juergen Keil)
From: c3314jcl@mercury.nwac.sea06.navy.mil (Johnson Lew)
From: Vesa Halkka <vhalkka@cc.helsinki.fi>
From: Tom Leach <leach@OCE.ORST.EDU>
From: pbg@cs.brown.edu (Peter Galvin)
From: ray@isor.vuw.ac.nz
From: Dave Capshaw <capshaw@asc.slb.com>
From: ups!kevin@fourx.Aus.Sun.COM (Kevin Sheehan {Consulting Poster Child})
From: stern@sunne.East.Sun.COM (Hal Stern - NE Area Systems Engineer)

        I received 24 replies altogether. Many people requested a summary.
        Several people suggested Hal Stern's book "Managing NFS & NIS".
        One respondent thought that the question was inappropriate for this
        list. I see his point but since there were quite a few requests for
        a summary I decided to go ahead with it.
        
        Most people suggested that the poor performance is due to:

        1. Synchronous writes of data and inode information to disk at the
           server end. (Resulting in a minimum of 2 disk head seeks per
           transfer)

        2. A fairly small maximum NFS block size in the protocol specification
           (currently 8k bytes). (Resulting in an increase in the number of
           NFS writes for large files)

        3. The NFS client blocks after each NFS write and awaits acknowledgement
           from the server. This causes all of the delays to add up in sequence.

        One respondent "shj@ultra.com (Steve Jay {Ultra Unix SW Mgr})"
        suggested that I test synchronous writes with a test program. I
        took his advice, wrote one (code included) and found that
        this represents the greatest performance loss. The server
        updates the inode synchronously after each (8k) data block write
        so there is a lot of disk head movement. Other respondents pointed
        out that with larger files there is even more seeking required
        because the inodes can be double and triple indirect (as many
        as 4 seeks/NFS operation).

        For example, with a 5 MB file, an access time of 10ms, and a single
        inode update this results in:

        (5MB/8KB) * 10ms * 2(seeks/write) = 12.8 seconds
        which is 409k bytes/second lost just in head movement!

        Asynchronous writes, transfer blocks of data to the buffer cache where
        they are later sorted to minimize head movement before being transferred
        to disk. Asynchronous operations also allow a process to overlap
        several I/O operations with each other and with processing which
        increases the total throughput.

        Write performance would improve if blocks larger than 8K could be
        used with NFS. This is evident in the data which I collected from my
        synchronous write tests (below). However, on a busy network where
        there are lots of collisions, too large a block would result in a
        degradation of performance since NFS uses UDP and thus would have
        to retransmit the entire block on failure.

Possible solutions:

        1. Get a Prestoserve board. This is a battery backed-up memory board
           which is written to instead of the disk. It acts as a sort of
           buffer cache for NFS writes. It improves NFS writes dramaticly.
           It is available from SUN Microsystems or Legatto.

        2. Get eNFS from Interstream. This is a "software only" solution which
           improves NFS write performance by 2-5 times. People claim that it
           is easier to manage etc. (search for "eNFS" for more info)

        3. Use a system other than SUN as an NFS server.
           Auspex servers seem to be good performers with or without their NFS
           accelerator board (SPIII).
           Silicon Graphics (SGI) NFS servers do asynchronous writes to disk
           so their read and write speeds are the same. (see "SGI" stats below)

        4. Hack your SUN kernel with "adb" to force it into doing asynchronous
           (dangerous) writes. Here is a previous sun-managers posting to
           explain how...
        ---------------------------------------------------------------------
        If you have a sun4c and SunOS 4.1.1, you can use the following sh
        script/patch to turn on asynchrounous NFS writes on the server for
        *all* NFS clients on the running kernel; rfs_write+258 should contain
        the value 0x96102004 in the original kernel. (I told you above why
        this is bad idea; NFS write performance is a lot better, though):
         
                #!/bin/sh
                adb -w -k /vmunix /dev/mem <<E_O_F
                rfs_write+258/W 96102000
                E_O_F
        --
        Juergen Keil jk@tools.de ...!{uunet,mcsun}!unido!tools!jk
        --------------------------------------------------------------------
           I tried this on a 670 (sun4m) running SunOS 4.1.2 and my "dd"
           command completed in less than 7 seconds or 760kB/s.

           NOTE: Use this at your own peril. If your server crashes and data
                 is lost this can have adverse effects on NFS clients.
                 It is especially bad if your clients swap to a file on the
                 server!

Sources of Information:

        - Managing NFS and NIS by Hal Stern
          ISBN 0-937175-75-7
          O'Reilly & Associates Inc
          103 Morris St. Suite A
          Sebastopol, CA 95472

        - November 92 Sun World article "NFS - the seven year itch"
          by Hal Stern
          Provides insight into some of the current problems and suggests
          some possible solutions.

        - A Fast File System for UNIX(tm)
          by Kirk McKusick, William Joy, Samuel Leffler, Robert Fabry
          Available on ftp.uu.net as:
          /systems/unix/bsd-sources/share/doc/smm/14.fastfs/*

        - The NFS Protocol Specification V3 by SUN Microsystems
          Available on ftp.uu.net as /networking/ip/nfs/NFS.spec.Z

        - Both Auspex and Legatto have published papers on this topic
          as well.

        - McVoy's paper in Usenix (Preceedings?) Jan 91 on the SunOS filesystem

Tests:

The following is a simple program which may be used to test synchronous write
speed against asynchronous write speed. The buffer size may be varied to
test that effect as well.
Use "cc -DSYNC ..." to write a 5 Meg file with 1 Meg buffering.
Use "cc -DSYNC -DBUFFER_SIZE=8192 ..." to emulate the "dd" command in my
original request.
----------------------
#include <fcntl.h>
#include <stdio.h>

#define MEG (1024 * 1024)
#ifndef BUFFER_SIZE
#define BUFFER_SIZE MEG
#endif
#define FILE_SIZE (5*MEG)

main(argc,argv)
int argc;
char **argv;
{
int i,fd;
char buffer[BUFFER_SIZE];

        if (argc < 2) exit(1);

#ifdef SYNC
        fprintf(stderr,"Opening \"%s\" in synchronous mode buffer size = %d\n",
                                argv[1], BUFFER_SIZE);
        if((fd = open(argv[1],O_RDWR|O_SYNC|O_CREAT,0644)) == -1)
#else /* ASYNC */
        fprintf(stderr,"Opening \"%s\" in asynchronous mode buffer size = %d\n", argv[1], BUFFER_SIZE);
        if((fd = open(argv[1],O_RDWR|O_CREAT,0644)) == -1)
#endif
        {
          fprintf(stderr,"Can't open file\n");
          exit(1);
        }

        /*
         * write in blocks just like:
         * "dd if=/dev/zero of=file bs=1024k count=5"
        */
        
        for(i=0; i < FILE_SIZE; i+=BUFFER_SIZE)
        {
          if(write(fd,buffer,BUFFER_SIZE) == -1)
          {
            fprintf(stderr,"Error writing file\n");
            exit(1);
          }
        }
exit(0);
}
----------------------

Output from the above program on a 2 processor 670MP w 128MB of memory
writing to a SCSI 2.2GB (Wren IX/ST42100N) disk (5MB/sec).

% cc tsync.c -DSYNC -DBUFFER_SIZE=8192 -o tsync
% time tsync 5-meg-file
Opening "5-meg-file" in synchronous mode buffer size = 8192
0.340u 1.530s 0:23.78 7.8% 0+92k 1+1282io 0pf+0w
5MB/23.78s = 220kB/s

Vary the buffer size to see the effect:

% cc tsync.c -DSYNC -DBUFFER_SIZE=8192 -o tsync --> 220kB/s
% cc tsync.c -DSYNC -DBUFFER_SIZE=16384 -o tsync --> 294kB/s
% cc tsync.c -DSYNC -DBUFFER_SIZE=32768 -o tsync --> 321kB/s
% cc tsync.c -DSYNC -DBUFFER_SIZE=65536 -o tsync --> 371kB/s
% cc tsync.c -DSYNC -DBUFFER_SIZE=1044576 -o tsync --> 391kB/s

Best possible speed (async mode & proper buffer size):

% cc tsync.c -DBUFFER_SIZE=8192 -o tsync --> 1927kB/s

My original request:

        I have been measuring NFS write performance and I have found that
        we rarely get better than 100k bytes/second from client to server.
        Using rcp I get more than 500kB/s with ftp often better than 200K.

        As I understand it the NFS protocol requires that NFS writes are
        synchronous. The server has to write the data block to the physical
        disk before it can acknowledge. This understandably will hinder NFS
        write performance, however, 100k still seems much too slow.

Q1. -> So where is the bottle-neck?

        I have tested our drives by writing huge (100MB) files to them and
        get impressive results 700-900k for SCSI drives and up to 3500k
        for IPI drives on a 670MP system so it's not the drives.

        I know that Ethernet is not the bottle-neck either because it is
        capable of just over 1.2MB/sec, which is over 10 times the NFS
        throughput that I am getting. At one point I thought that we
        might be getting too many collisions on our network so I isolated
        a SS2 and a 670MP server on their own network (using a thicknet
        fanout box) and still got the same lousy 100k write performance.
        So the problem is not the Ethernet hardware.

        I have used "spray" to measure the end to end UDP performance and
        this too does not seem to be a problem, throughput is in the 800-900k
        range.

        I thought that perhaps the Jumbo NFS patch (100173-08) would fix
        this but after applying it to both the client and server ends
        the 100kB/s or less write performance persists.

        I know about the Prestoserve NFS accelerator product and the kernel
        mods which make the NFS server do asynchronous NFS writes
        (at your own peril). I don't think that it would do any good to
        increase the number of daemons at either end because this only helps
        whenever you have simultaneous NFS access.

Q2. -> Is this just the "normal" NFS write speed? If so then can someone
        explain why it is so terribly slow?

        I would appreciate if people would try the "dd" script below
        before sending me your solution (you might be surprised).

        The Tests
        ---------

        I have tried both the following "dd" script and a home-spun c
        program and get the same NFS write performance on the listed
        platforms.

        The command & results
        ---------------------

        time dd if=/dev/zero of=5-meg-file bs=1024k count=5
        5+0 records in
        5+0 records out
        0.010u 2.030s 0:55.96 3.6% 0+570k 0+642io 0pf+0w ( 55.96 seconds = 94k )

        Notes: The output file "5-meg-file" is an NFS mounted file.

        Client Server SUN-OS version
        ------ ------ --------------

        SS2 48MB 670MP 128MB 4.1.2 (Both ends) (With & without patch)
        SS1+ 32MB 630MP 64MB 4.1.2 (Both ends)
        SS1 24MB IPC 32MB 4.1.1 (Both ends)
        SS2 64MB SS2 64MB 4.1.2 (Both ends)

        For the most part these are "out-of-the-box" systems with MAXUSERS
        changed to 180(MP) and 225(SS2) in most cases.

The responses: (My comments are indented with "-->")

-----------------------------------------------
From: blymn@awadi.com.AU (Brett Lymn)
-----------------------------------------------

>Q1. -> So where is the bottle-neck?
>

I think that it is from the NFS spec that says that any NFS write is a
write to disk on the server.

>Q2. -> Is this just the "normal" NFS write speed? If so then can someone
> explain why it is so terribly slow?
>
> I would appreciate if people would try the "dd" script below
> before sending me your solution (you might be surprised).
>

        Client Server SUN-OS version
        ------ ------ --------------
        IPC 24Meg 670MP 128Meg 4.1.2 ~82K/sec

Have you seen the book "Managing NFS and NIS" from O'Reilly? It is
written by Hal Stern and has a large section on tuning NFS.

--> Yes, I just bought it, an excellent book, highly recommended.

-----------------------------------------------
From: pla_jfi@pki-nbg.philips.de (Karl-Jose Filler)
-----------------------------------------------

Some opinions and/or experiences to your questions

Measurements
SS10 (4.1.3 32 MB) -> SS2 (64 MB, Seagate Elite II, SCSI) 35 sec == 143 KB/sec
SS10 -> 4/690 (128 MB) + Prestoserv + NC400 Boards + IPI Disks
                                9 sec == 555 kB/sec
The machines are running light to medium load ( Most people went
for eating, so this is nearly maximum for our environment )

Opinion for bottleneck :
NFS is waiting for the disk drive tp complete its operation, so
the time adds up as
        local processing +
        transfer through the net +
        server processing +
        transfer to disk +
        server respone +
        local processing
In contrast to a locally cached file system.

-----------------------------------------------
From: "(Cengiz Oezcan-Barlach)" <coez@tet.uni-hannover.dbp.de>
-----------------------------------------------

Hi Phil,

between a SS1 server and a SS1 client I got the same performance
with your dd script. Please summarize...

-----------------------------------------------
From: Charles A Finnell <finnell@portia.mitre.org>
-----------------------------------------------

Phil,

Here's my results on diskless SS2 w/ 32-Mbytes RAM using NFS mounts from SS
4/370 fileserver:

% time dd if=/dev/zero of=5-meg-file bs=1024k count=5
5+0 records in
5+0 records out
0.0u 1.8s 1:29 2% 0+1108k 2+640io 15pf+0w

I'd like to hear of any software changes that can improve this speed, too!

-----------------------------------------------
From: scs@lokkur.dexter.mi.us (Steve Simmons)
-----------------------------------------------

Intending no offense, this doesn't seem appropriate for Sun-Managers.
It's not a high-priority rapid-turnaround trouble report/response.

Now to answer your question -- both Legato and Auspex have published
a variety of white papers on the topic. They'd be a good place to
start. Call reps of either company for more details.

-----------------------------------------------
From: rasesh@il.us.swissbank.com (Rasesh K. Trivedi)
-----------------------------------------------

Besides some the things you mentioned, bottleneck is NFS protocol itself, since writes are
synchronous, client has to wait until an acknowledgement is received for previous write
operation, to proceed with next write request. Also NFS block size is limited to 8k limiting the
size of each write operation to 8k. Some discussions suggests that blocksize limit of 8k should
lifted and made negotiable, in future enhancements to NFS.

-----------------------------------------------
From: poffen@sj.ate.slb.com (Russ Poffenberger)
-----------------------------------------------

Well, it is common knowledge that Sun's do not make a great file server. Your
test on a 4.1.3 SunOS client (SS-IPX) writing to an Auspex fileserver yields

time dd if=/dev/zero of=5-meg-file bs=1024k count=5
5+0 records in
5+0 records out
0.0u 2.0s 0:29 7% 0+279k 1+641io 1pf+0w

which is about twice as fast as your benchmark. A prestoserve (in your case),
or an SPIII (Auspex equivalent of prestoserve) will help a lot. I do not have
an SPIII in my Auspex, nor is the filesystem asynchronous, however the
filesystem is striped across 2 disks to avoid disk bottlenecks. I cannot
guarantee how quite the ethernet is at the time I ran this test, so there
is likely other traffic as well. Running this on a fast disk (HP C3010 2GB
5400 RPM SCSI-II) on the Auspex with the filesystem makred async, yields..

time dd if=/dev/zero of=5-meg-file bs=1024k count=5
5+0 records in
5+0 records out
0.0u 2.0s 0:09 21% 0+278k 0+641io 0pf+0w

which is 6 times faster than your benchmark. You would see this kind of
performance for ALL filesystems on an Auspex with a write cached SPIII.

If you want a good file server, look into an Auspex, very reliable, better
performance than Sun, and the service is outstanding.

Auspex can be reached at (408)492-0900.

-----------------------------------------------
From: dan@bellcore.com (Daniel Strick)
-----------------------------------------------

This is an old problem with Sun's NFS implementation. You can
"fix" it with hardware or with software. The hardware solution
is "Prestoserve". I think you can buy it from Sun. Check with
your sun salesperson. The software solution is "eNFS" from
Interstream. (phone number: (800)677-7876.)

I use the software solution. It is cheaper and much easier to
manage.

-----------------------------------------------
From: dan@bellcore.com (Daniel Strick)
-----------------------------------------------

(Oh, I forgot to explain the problem: the "synchronous" nature of
NFS writes requires that not only the data block but also the
inode block and any indirect blocks be rewritten before the server
can acknowledge the NFS client's request. The sun NFS server
implementation stupidly rewrites the inode and indirect blocks
once for each NFS request. The Interstream version of the
NFS server attempts (with considerable success) to collect
NFS write requests into groups and only does one inode and
indirect block update per group.)

-----------------------------------------------
From: srm@shasta.gvg.tek.com (Steve Maraglia)
-----------------------------------------------

I believe you've answered your own question.

        The server has to write the data block to the physical
        disk before it can acknowledge the udp packet to the client.
        
Here's my results of your dd test.
        
Client Server OS Version
-------------------------------------------
SS2 48Mb SS2 32Mb 4.1.2 (Both ends, no NFS patches applied

time dd if=/dev/zero of=5-meg-file bs=1024k count=5
5+0 records in
5+0 records out

real 0m55.65s
user 0m0.01s
sys 0m1.81s

Results were 95K per second!

I would be interested in the SUMMARY from all the reply's.

-----------------------------------------------
From: baumeist@vsun04.ag01.Kodak.com (Hans Baumeister)
-----------------------------------------------

Phil,

I come up with:
0.0u 1.2s 0:05 22% 0+1116k 3+124io 1pf+0w

If I'm reading it right, we're doing 5 Megs in 5 seconds real-time
or 1 Meg/sec....

not to gloat, but... :-)

If you need to know any system setup parameters, please let me know!

Regards,

-----------------------------------------------
From: shj@ultra.com (Steve Jay {Ultra Unix SW Mgr})
-----------------------------------------------

>Q1. -> So where is the bottle-neck?

The bottleneck is the combination of synchronous writes and the 8k block
size used by NFS. Each 8k written from the client to the server is
flushed to disk before the next 8k is sent.

To see what effect this has on disk throughput, try a little test
program which writes 8k blocks, optionally with a O_SYNC in the
open flags. On my IPX, when I write 8k blocks to a file opened
without O_SYNC, I get well over 1 Mbyte/sec. But, if I add O_SYNC
to the open call, it slows down to about 130 Kbyte/sec. So, even
on writes to local disk, the combination of 8k blocks and synchronous
writes slows things down to the 100K range.

Answering 2 obvious questions:

1. No, there is no way to make the block size larger than 8k.

2. Unlike several other vendors (such as SGI), Sun has refused to
    implement an "async write" option to NFS. However, a Prestoserve
    board on the server will accomplish the same thing.

-----------------------------------------------
From: David Wheable <djw@advanced-robotics-research-centre.salford.ac.uk>
-----------------------------------------------

Your probably going to get lots of replies like this!

I don't know the problem but I just tried you script on my setup. A 670MP 64Mb
4 x 1.3Gb SCSI disks and a prestoserve card. Timed on an IPC

I did three runs ...
0.0u 3.5s 0:11 31% 0+1140k 2+642io 2pf+0w
0.0u 3.5s 0:10 34% 0+1112k 3+642io 3pf+0w
0.0u 3.5s 0:10 33% 0+1112k 0+641io 0pf+0w

I make that 465k, 512k and 512k

Running the same on the 670 gets me...
0.5u 1.2s 0:02 67% 0+1012k 0+104io 0pf+0w

which is about 2.5M a second. I tried this because the prestoserve should
accelerate local file systems as well as NFS.

Don't know if this helps at all.

-----------------------------------------------
From: todd@flex.eng.mcmaster.ca (Todd Pfaff)
-----------------------------------------------

Phil,

I tried your dd test between various clients and servers:

client> cd /server-nfs-file-system
client> time dd if=/dev/zero of=5-meg-file bs=1024k count=5
5+0 records in
5+0 records out
        
client server time KB/s
--------------------------------------------------------------------
670MP/2 96MB SS2 32MB 1:10.49 74
                        470 64MB 1:06.00 79
                        SGI Crimson 80MB 0:09.28 565
SS2 32MB 670MP/2 96MB 1:19.28 66
                        470 64MB 1:08.58 75
                        SGI Crimson 80MB 0:07.98 657
SGI Crimson 80MB 670MP/2 96MB 0:59.83 88
                        470 64MB 1:55.91 45
                        SS2 32MB 1:00.88 86

All server filesystems are SCSI disk partitions.

As a matter of comparison, on a local filesystem on the 670MP the
write takes 0:05.69 seconds (~900KB/s).

On a local filesystem on the SGI Crimson the write takes 0:00.48.

This is very interesting. Why do you suppose the SGI Crimson is an
order of magnitude faster? Could it have something to do with the
fact that this test is only writing zeros?

--> No, it is likely due to more memory bandwidth on the Crimson. Both
--> systems have much more than 5MB so on a local disk the writes would
--> go straight to cache. Your 670 may have been busy at the time too.
--> I get almost 2MB/sec to a 2.2Meg SCSI (ST42100) on our 670, but
--> again that is going to be straight to cache.

Please summarize these results to sun-managers so we can see if anyone
else has any insight into why there is such a difference.

-----------------------------------------------
From: ups!upstage!glenn@fourx.Aus.Sun.COM (Glenn Satchell)
-----------------------------------------------

Hi Phil,

I don't have a network here to test your script on, but I'd like to
make three suggestions:

1. Get a copy of Hal Stern's NFS & NIS - there is a very good section
on performance tuning and theory of nfs to help you on your way.

> For the most part these are "out-of-the-box" systems with MAXUSERS
> changed to 180(MP) and 225(SS2) in most cases.

2. Why are these values so large? - even the busiest servers I have seen
get by with a MAXUSERS of 64. This allows for NPROC, the number of
processes, to be over 1000. By making the value of maxusers so large
you *reduce* the amount of kernel memory available for disk buffers,
etc. This is something that can hurt nfs performance.

--> Good point, but since I have 128MB it wouldn't make much of a difference.
--> Besides, I was measuring NFS write performance which doesn't use the
--> buffer cache on either systems.

3. You don't quote any figures from nfsstat, it would be useful to run
nfsstat on each system before and after the 5-meg-file test to see the
difference. That will tell you how many collisions, etc. Are there any
links in the filesystem because that can cause extra nfs lookups too.
Hal's book explains how to analyse the numbers you get.

--> I should have mentioned this. Collisions were the first thing that
--> I looked for but they were less than 5%. I also isolated a server
--> and workstation on a network of their own and still got the same results
--> with zero collisions. Symbolic links are not a problem in this particular
--> case.

Good luck with your problem.

regards,

-----------------------------------------------
From: Alan J. Rothstein <merccap!alan@uunet.UU.NET>
-----------------------------------------------

I don't know where the bottleneck is but there is a company that
provides a soloution called eNfs that is supposed to correct just what
you have found. What follows is an old sunflash that describes the
product.

----------------------------------------------------------------------------
                                                        The Florida SunFlash

           Third Party: New Version Of eNFS Introduced

SunFLASH Vol 34 #20 October 1991
----------------------------------------------------------------------------
This is a third party product announcement from
bjd@interstream.com (Bruce J. DaCosta) -johnj
----------------------------------------------------------------------------

(PITTSBURGH, PA) INTERSTREAM, Inc. introduced the latest version of its eNFS
product, which boosts NFS write performance 2 to 5 times. "This new version
of the software includes our new feature, eNFS/Display, which allows the user
to see how eNFS improves their NFS server," states Bruce J. DaCosta, president
of INTERSTREAM.

The graphical display consists of five panels, which can give information
about the write mix performed by their NFS server. The graphs display
information at user selected time intervals, giving the user a dynamic view
of the system. One panel displays the write efficiency, which gives
a quantitative comparison of eNFS to normal NFS. Another panel shows the
mixture of writes on the server. Sequential, random, diskless client and
total writes are displayed.

INTERSTREAM also announced an aggressive pricing schedule for cross-mounted
NFS systems. "Our former pricing schedule did not accommodate cross-mounted
systems and their need for more than one copy of eNFS. The new schedule
takes this need into account," says President Bruce J. DaCosta.
The new pricing ranges from $995.00 to $350.00 for desktop servers
and $1995 to $1,200 for traditional file servers.

eNFS dynamically loads on Sun Microsystems workstations in 10 minutes,
comes with a 30-day money-back guarantee and one year maintenance agreement.

For more information, contact INTERSTREAM at 1-800-677-7876.

INTERSTREAM, Inc.
1501 Reedsdale Street
Pittsburgh, PA 15233
(412) 323-8000
(412) 323-1930 FAX
bjd@interstream.com

NFS is a registered trademark of Sun Microsystems, Inc.
The name eNFS is exclusively licensed to INTERSTREAM, Inc.
INTERSTREAM is a licensee of Sun's ONC/NFS trademark.
+
-----------------------------------------------
From: jk@leo.tools.de (Juergen Keil)
-----------------------------------------------

> synchronous. The server has to write the data block to the physical
> disk before it can acknowledge. This understandably will hinder NFS
> write performance, however, 100k still seems much too slow.

Not only the data block, but also the inode and all the indirect blocks
used for the file! So you have 2+ synchronous writes for each NFS write
request! You may have to seek a bit to move the head from the data area
to the inode area of a cylinder group!

For your test I get:

   Client Server SUN-OS version Time
   ------ ------ -------------- ------
   SS2 16MB SS2 32MB 4.1.1 (Both ends) 64.9 sec = 93.2 Kbytes/sec
   SS2 16MB SS2 32MB 4.1.1 (Server with patch) 7.9 sec = 648 Kbytes/sec

-----------------------------------------------
From: c3314jcl@mercury.nwac.sea06.navy.mil (Johnson Lew)
-----------------------------------------------

This is an interesting problem. Can you post your solutions?
Thanks in advance.

-----------------------------------------------
From: Vesa Halkka <vhalkka@cc.helsinki.fi>
-----------------------------------------------

I recorded some NFS discussion between a SS1 (serifos, light load) and
a SS2 (klaava, load about 4, mthreads, which builds news threads
running)

I used your dd line to fast SCSI disk on klaava.
The dd line, when run locally, took about 2.5 seconds (2.5Mb/s),
which is about the write speed you can get from a
Seagate Elite2 disk.

0.1 seconds to start it:

00:20:54.599166 serifos.d72e2 > klaava.nfs: 120 getattr
00:20:54.634661 klaava.nfs > serifos.d72e2: reply ok 96
00:20:54.636711 serifos.d72e3 > klaava.nfs: 136 lookup
00:20:54.647411 klaava.nfs > serifos.d72e3: reply ok 28
00:20:54.649224 serifos.d72e4 > klaava.nfs: 136 lookup
00:20:54.651338 klaava.nfs > serifos.d72e4: reply ok 28
00:20:54.653891 serifos.d72e5 > klaava.nfs: 168 create
00:20:54.698776 klaava.nfs > serifos.d72e5: reply ok 128
00:20:54.701175 serifos.d72e7 > klaava.nfs: 120 getattr
00:20:54.702822 klaava.nfs > serifos.d72e7: reply ok 96

NFS writes in 8k blocks:

00:20:55.143215 serifos.d72e8 > klaava.nfs: 1472 write (frag
bdb5:1480@0+)
00:20:55.144350 serifos > klaava: (frag bdb5:1480@1480+)
00:20:55.145716 serifos > klaava: (frag bdb5:1480@2960+)
00:20:55.146923 serifos > klaava: (frag bdb5:1480@4440+)
00:20:55.148254 serifos > klaava: (frag bdb5:1480@5920+)
00:20:55.149627 serifos > klaava: (frag bdb5:936@7400)

00:20:55.150701 serifos.d72e9 > klaava.nfs: 1472 write (frag
bdb6:1480@0+)
00:20:55.151868 serifos > klaava: (frag bdb6:1480@1480+)
00:20:55.153092 serifos > klaava: (frag bdb6:1480@2960+)
00:20:55.154425 serifos > klaava: (frag bdb6:1480@4440+)
00:20:55.155671 serifos > klaava: (frag bdb6:1480@5920+)
00:20:55.156452 serifos > klaava: (frag bdb6:936@7400)

which goes pretty well, 8k in 0.006 seconds, you can't do
much faster..

BUT.. after some time it stops and waits for acknowledgement:

00:20:55.445051 klaava.nfs > serifos.d72ee: reply ok 96
00:20:55.453404 serifos.d72f3 > klaava.nfs: 1472 write (frag bdc0:1480@0+)
00:20:55.454685 serifos > klaava: (frag bdc0:1480@1480+)
00:20:55.455866 serifos > klaava: (frag bdc0:1480@2960+)
00:20:55.457136 serifos > klaava: (frag bdc0:1480@4440+)
00:20:55.458401 serifos > klaava: (frag bdc0:1480@5920+)
00:20:55.459145 serifos > klaava: (frag bdc0:936@7400)

00:20:55.478350 klaava.nfs > serifos.d72ef: reply ok 96
00:20:55.511798 klaava.nfs > serifos.d72f0: reply ok 96
00:20:55.533803 klaava.nfs > serifos.d72f1: reply ok 96
00:20:55.557383 klaava.nfs > serifos.d72f2: reply ok 96
00:20:55.589449 klaava.nfs > serifos.d72f3: reply ok 96

00:20:55.608254 serifos.d72f4 > klaava.nfs: 1472 write (frag
bdc1:1480@0+)
00:20:55.609558 serifos > klaava: (frag bdc1:1480@1480+)
00:20:55.610877 serifos > klaava: (frag bdc1:1480@2960+)
00:20:55.612008 serifos > klaava: (frag bdc1:1480@4440+)

But you knew this.. I will try to get a whole file
transfer to a tcpdump file. But not 5megs in this
format ;-)

If it contains anything interesting i'll send you a mail.

-----------------------------------------------
From: Tom Leach <leach@OCE.ORST.EDU>
-----------------------------------------------

Phil
I don't know if it accounts for the slow write rates, but NFS packets seem
to be quite a bit smaller then they could be. Of a possible MTU of 1500
bytes for TCP/IP, NFS looks like it's using packet sizes of 64-80 bytes.
I think that the TCP/IP header is approx 64 bytes so that would immediately
decrease any ethernet performance by a significant ammount.
(This is all off the top of my head so the MTU and header size is approximate
The NFS size is what I get from my (97% NFS traffic) net with traffic.

-----------------------------------------------
From: pbg@cs.brown.edu (Peter Galvin)
-----------------------------------------------

As you say, rcp is allowed to write to cache and then to disk
(asynchronously), while NFS isn't (on Suns at least....SGI runs in
"dangerous mode" and does asynchronous writes even over NFS).

With a 500kb/sec rate to memory, I'm not surprised at all at 100kb/sec
to disk. Remember it has to update the inode information after each
write so there is at least some seeking after every block (or cluster
of blocks, really). An empty disk might do better because it could
find larger clusters to write to. (See McVoy's paper in Usenix Jan 91
for info on the Sun file system.)

A solution would be to use a prestoserve board in the server to cache
the writes in static memory.

-----------------------------------------------
From: ray@isor.vuw.ac.nz
-----------------------------------------------

Here are some results I get for the command:
time dd if=/dev/zero of=5-meg-file bs=1024k count=5

1. SS2 -> local 1.3GB 0.0u 1.3s 0:03.68 36.6% 0+553k 9+230io 2pf+0w
        ditto 0.0u 1.3s 0:03.32 41.5% 0+523k 0+190io 0pf+0w

2. SS2 -> SS2 1.3GB 0.0u 1.7s 0:42.34 4.1% 0+554k 4+641io 1pf+0w
        ditto 0.0u 1.8s 0:42.82 4.2% 0+526k 0+641io 0pf+0w

3. SS2 -> local 1.3GB (Presto) 0.0u 1.3s 0:02.51 52.9% 0+525k 4+111io 0pf+0w
        ditto 0.0u 1.3s 0:02.31 59.7% 0+522k 0+108io 0pf+0w

4. SS2 -> SS2 1.3GB (Presto) 0.0u 1.7s 0:10.48 16.2% 0+555k 2+641io 0pf+0w
        ditto 0.0u 1.7s 0:09.37 19.3% 0+533k 0+641io 0pf+0w

5. SS1 -> local 669MB 0.0u 3.0s 0:06.38 47.6% 0+543k 0+99io 3pf+0w
        ditto 0.0u 3.0s 0:05.17 59.5% 0+547k 0+98io 0pf+0w

6. SS1 -> SS2 1.3GB 0.0u 3.5s 0:42.46 8.4% 0+547k 0+641io 7pf+0w
        ditto 0.0u 3.5s 0:43.16 8.2% 0+545k 0+641io 26pf+0w

7. SS1 -> SS2 1.3GB (Presto) 0.0u 3.8s 0:11.12 34.4% 0+547k 1+641io 15pf+0w
        ditto 0.0u 3.6s 0:10.32 35.6% 0+550k 9+641io 21pf+0w

8. SS2 -> local 207MB 0.0u 1.2s 0:05.86 22.0% 0+549k 15+123io 2pf+0w
        ditto 0.0u 1.3s 0:05.77 24.2% 0+517k 0+113io 0pf+0w

9. SS1 -> SS2 207MB 0.0u 3.5s 1:26.17 4.1% 0+543k 0+641io 17pf+0w
        ditto 0.0u 3.6s 1:27.10 4.1% 0+550k 0+641io 17pf+0w

10. SS2 -> SS2 207MB 0.0u 1.6s 1:21.50 2.0% 0+553k 2+641io 0pf+0w
        ditto 0.0u 1.7s 1:22.09 2.1% 0+518k 1+641io 0pf+0w

11. IPC -> local 207MB 0.0u 2.6s 0:07.60 34.7% 0+531k 0+139io 2pf+0w
        ditto 0.0u 2.6s 0:06.28 42.8% 0+536k 0+145io 0pf+0w

I think it is fairly obvious from these numbers that it is the receiving
system that is the bottleneck, and this is caused by the number of IOs
that are occurring. Presumably the NFS traffic is split into smaller
8Kbyte blocks, each of which requires an I/O.

I would like to see a summary of your findings.

-----------------------------------------------
From: Dave Capshaw <capshaw@asc.slb.com>
-----------------------------------------------

I think that you will find that NFS write performance is approximated by the
product of pipeline size and inverse transaction time. For example, a 32KB
pipeline (4 biod client with 8KB wsize) and 200 mS operation time gives 160
KB/S. Devices like the Prestoserve work by reducing the operation time. Our
Auspex file server configured with the asychronous write option gives write
rates like 900 KB/S which is closer to the limit imposed by Ethernet.

-----------------------------------------------
From: ups!kevin@fourx.Aus.Sun.COM (Kevin Sheehan {Consulting Poster Child})
-----------------------------------------------

rcp, ftp and the like don't have to wait for the whole round trip. Read
the Legato papers on this problem, it's exactly what they studied in
creating the Presto Serve board.

-----------------------------------------------
From: stern@sunne.East.Sun.COM (Hal Stern - NE Area Systems Engineer)
-----------------------------------------------

(a) if you're writing large files, note that you may do 2-4 writes
        per NFS operation:
        1 8k data buffer
        1 inode update (length/mod time)
        1 indirect block update
        1 double indirect block update (for really large files)

        since you're doing 100M files, each write after the
        first 2M or so kicks off 4 synchronous disk operations.
        that's most of the problem: that's about the difference
        you see between rcp and NFS.

(b) presto DOES NOT DO ASYNC writes. the writes are still synchronous
        and "safe". they're just done to battery-backed memory,
        which gets dumped to disk when it fills up. the data is
        committed to a stable storage device before the RPC reply
        is sent, which makes the writes sync. but because the
        inode, direct block and indirect block can be cached,
        you do serious write elimination - only 1 64k writes
        leaves the cache, instead of 8 * 4 writes = 32 writes
        for the same amount of data w/o presto

--hal

-- 
Phil Blanchfield
The Communications Research Centre 3701 Carling Avenue, Ottawa Ontario CANADA
Internet: phil@dgbt.doc.ca



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:53 CDT