SUMMARY II: NFS Timeout

From: Lau, Victoria H (vlau@msmail2.hac.com)
Date: Wed Jun 04 1997 - 15:20:35 CDT

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

After I posted a summary on NFS Timeout, I'd received three more
responses that may be helpful to some of us who are dealing with
the NFS problem. I'm posting the new responses here and attach
the old summary below.

================
>From David Robson:
I HAVE EXACTLY THE SAME PROBLEM! 8-0

I'm dissapointed that no one has a fix, but at least I'm not alone! :-)

The difference is I have HP-UX clients rather than AIX, but I also have a
Ultra.
Ours is a 2200 running 2.5.1. I also run Online disksuite on a SSA in RAID 5.

One of the problems that we worked out with Sun is that because the clients
use NFS V2, each mount uses up to 5 threads each! Thats why you have to run
nfsd with more than 16 threads! Of course they're usually not all hit at the
same time, but when the nework is busy it can cause problems - or so I
thought. But I've noticed these errors logged in the midle of the night when
only backups are running...
We have a Cisco 3200 sitting between the 10MB clients and the 100MB to the
Ultra. I have a suspision that it way be part of the problem. I'm chasing my
vendor to tell me if a softwre upgrade for the Cisco is required. (Sun did
hint that they had a similar problem in "the lab".

Hope to here good news about this soon!

PS Why can't the other vendors run NFS V3 and save a _lot_ of traffic! :-(
================
>From Mark A. Baldwin
Is it possible that you have Solstive Disksuite installed on this machine?
If so, you may want to look into patch 104172-05. This is actually
a 2.4 patch, but we had the same problem and Sun advised us to apply this
patch. It seems to be working.

[No, I'm not running Solstice.]
================
>From Larry Williamson
I get this same problem quite frequently. It always happens between
only two of our machines, a Sparc 10 and an E3000.

These machines are both connected to the same 100BaseT hub. The Sparc
10 has an hme card (forget just which one, but I can look).

The Sparc 10 is our NIS+ server.

It has not happened now for more than a week. I don't remember for
sure of it always the sparc 10 or the E3000 (but I think it is the
E3000) that is "not responding", but I am quite certain it is always
the same.

When it happens, I find the only thing that works is to reboot the
machine that is 'not responding'. This is why I think it is the
E3000, because the sparc 10 was just this week rebooted for the first
time in about 80 days.

I have only one other machine on this same 100BaseT hub (a sparc 20),
it is not an nfs server, but uses the nfs services of both the sparc
10 and the E3000. It never complains about the nfs server not
responding. Neither do any of the other couple of dozen sparc machines
here (although they are all on switching 10BaseT hubs).

I cannot offer any solution yet, but next time it happens I will pay
more attention. Maybe a real solution can be found.
================
Original summary:

It has been a month since I posted the following "NFS Timeout" question.
The problem still exists today, but intermittently. I had both network
and hardware personnel involved, like checking all hardware and taking
some traffic-makers off the net. It helps some but the problem
always comes back to haunt me after a week or a few days.

I did not mention in my original post that when I was doing files
copy, it was from/to a local file system to/from an AIX (4.2) NFS
mounted file system. This AIX NFS mounted file system is the home file
system for all users in the project. Since these two OSs run different
versions of nfs (Solaris v3, AIX v2), rlogin and cp from/to these
systems react differently because they do not use the same protocols.

>From all the responses, I'd added/changed the following files and
patches:

/etc/auto_direct (clients):
/sun_local -rw,intr stfsun1:/sun_local

/etc/init.d/inetinit (both server and clients):
#increase the maximum # of tcp connections from 32 to 1024
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max 1024
#increase the maximum waiting period for transmissions
/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_cwnd_max 65534

Added the following patches:
- 2.5.1 recommended (includes 103582-10 tcp)
- 103903-02 (le-for sun4m only)
- 104166-01 (nfs)
- 104212-03 (hme-for sun4u only)
- 104672-02 (nfs)

Credits:
=======
I sincerely thank the following Sun-Managers helping me with the
problem, especially Justin Young who continusously supported me with
new ideas:
- Peter Marelas
- Glenn Satchell
- Stuart Little
- Justin Young
- D. Ellen March
- Marc S. Gibian
- Wendy Mullett
- Karl E. Vogel
- Marcos Campos de
- G. Bhaskar

Original Question:
=================
We have an Ultra II (stfsun1) running Solaris 2.5.1, serving as our
nfs file server, exporting a file system to all hosts. The entry
in the dfstab is:

share -F nfs -o anon=0 /sun_local

On the Sun clients, also running Solaris 2.5.1, we have automounted
this file system as follows:

/etc/auto_master: /- /etc/auto_direct -intr
/etc/auto_direct: /sun_local -rw,intr,timeo=20 stfsun1:/sun_local

We have no problems accessing stfsun1 from all the clients (rlogin,
rsh, etc.). But, whenever we copy files to/from /sun_local
on the clients, the following messages appear both on the server
and on the client:

NFS server stfsun1 not responding still trying
NFS server stfsun1 ok

This goes on for a long time, slowing the copying process from seconds
to even hours. Where do I start troubleshooting? If this is a hardware
issue, why don't I see the above message when I rlogin to this server
from the same client? I have no problem rlogin to the server
from the client and edit files all day, as long as I don't copy files
from or to /sun_local on the client.

Responses:
=========
I'd say it's using TCP for NFS which is a connection-orientated transport.
The default timeo for TCP is 100 tenths of a second, and 11 for UDP.
Your setting it to 20 for TCP, which is 1/5th of the default.
I would remove "timeo=20".
================
It can be a hardware problem because rlogin will typically use very
small packets while NFS uses large packets. Maybe you have a hub or
slow clients that can't keep up with the fast Ultra?
================
As a matter of interest have you ensured that
stfsun1 doesn't try to mount /sun_local. Don't know
what 2.5.1 does since I've not tried it but mounting
server exported filesystems on the server at the same mount
used to cause problems on 2.3/2.4

If you haven't then one fix is to add
+/etc/auto_null in /etc/auto_master before
anything else then an entry of the form

/sun_local -null
in auto_null.

Either that or always mount exported filesystems somewhere
different to the exported path.

I also assume that stfsun1 was restarted, or at least
/etc/rc3.d/S??nfs.server ran if this was the first entry in
dfstab for stfsun1.

How many clients do you have? If a lot then change /etc/rc3.d/S15nfs.server
to start more nfsds. Look for /usr/lib/nfs/nfsd -a <number> and change the
number to say, 2 * number of active clients.
================
#1) Make sure you've patched your server with the tcp patch and hme patches
from sunsolve. In addition, do the recommended kernel patches, etc.
tcp patch 103582-10
hme patch 104212-03
nfs patch(es) 103600-12,104672-02,104166-01

#2) Add the following lines to /etc/rc2.d/S69inet
#increase the maximum # of tcp connections from 32 to 1024
/usr/sbin/ndd -set /dev/tcp tcp_conn_req_max 1024
#increase the maximum waiting period for transmissions
/usr/sbin/ndd -set /dev/tcp tcp_xmit_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_recv_hiwat 65535
/usr/sbin/ndd -set /dev/tcp tcp_cwnd_max 65534
#Ignore those people who tell you to modify your /etc/system
#changes to the kernel are dangerous and you pretty much get the
#same results from /etc/rc2.d
#The only difference is that /etc/system changes work in single user
#mode. If it means that much, change /etc/rc1.d, too.

[IBM suggested to me that we should downgrade Solaris
system nfs version from v3 to v2.--vicky lau]

However, the Sun server is perfectly capable of auto-switching to nfs v2.
However, nfs v2 is not very efficient.

Solaris is even less efficient when it has to switch between nfs v2 and 3.

Gigabit products won't be out till later this year.

If you company can afford it, you might consider SONET. (ATM-622).

That's about the only thing I can think to throw at it. That way the
packets will get there quicker and *hopefully* won't have time to timeout.

Solaris 2.6 *should* fix any Solaris related network problems.

I just reproduced your error. Not on a subnet though.

SUNW,hme0: late collision ???

What's the deal with your CPU. I had about 20 engineering students who
all decided to do their analysis at the same time. They *choked* my
Ultra Enterprise 2. Oh well, I told them that they should have bought a
3000.

My load average was above 5 at one point and the idle on the CPU was 0.0%.

Now I'm getting collisions, etc.

That's an easy explanation. The CPU was so busy with everything else
that the schedular didn't have time for the network.

The fact that you even have a queue column suggests a problem. However,
I'm not convinced that throwing higher bandwidth at it is the proper
solution.
================
check the number of nfs demons running in your server to serve the
clients ,if it insufficient pl.. change the value and restart the demon
eg: nfsd XX where xx stands for no..
================
SRDB ID: 11153

SYNOPSIS: NFS mounts hang, get SERVER NOT RESPONDING

DETAIL DESCRIPTION:

Hosts remotely mount other hosts, often times via routers, bridges with
straight NFS mounts or with automount. I tried installing various kernel
and NFS Jumbo patches with no success.

Any process accessing those remote hosts just hangs forever or gets
the message:
NFS SERVER <servername> NOT RESPONDING
...sometimes followed by NFS SERVER <servername> OK

showmount -e <servername> listed the remote exports file as I expected, so
I know that I can talk to the server's NFS process.

SOLUTION SUMMARY:

NFS SERVER NOT RESPONDING means that many things could be at fault:
1. The server is down or unable to respond (e.g. too busy)
2. The network is not reliable
3. Software or firmware problems on any component in the network, possibly
including the NFS client and/or NFS server.

For case 1, lighten the load of the server, or migrate files to a less busy
server.

For case 2 and 3, we recommend the following workaround of
changing mount options, as in the following examples:

/usr/etc/mount -orsize=1024,wsize=1024,timeo=15 server:/disk /mnt (SunOS)

/usr/sbin/mount -F nfs -o rsize=1024,wsize=1024,timeo=15 server:/disk /mnt
(Solaris)

The 1024 read and write packet size allows NFS requests/responses to squeeze
inside a single network (e.g. ethernet) packet instead of the default 8k size.
The helps eliminate fragmentation across a bridge or router, as well as
UDP packet reassembly, although the actual NFS performance is somewhat slower.

Increasing the initial request timeout from 7 to 15 (units are tenths of
seconds) often helps in congested networks.

Please note that we also recommended installation of any NFS and Kernel
jumbo patches.

PRODUCT AREA: Gen. Network
PRODUCT: NFS
SUNOS RELEASE: any
HARDWARE: any

Thank you, everyone.

Vicky Lau

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:56 CDT