To sun managers:
I have finally been able to get together the summary on
this question. Many thanks to the responders.
------------------------------------------------------------------------------
Original problem is:
Subject: Re: NFS error messages - HELP if possible
Dear Sun Managers,
I have a weird NFS problem, as described below:
============================================================================
Setup of the machines involved:
1-SPARCsystem 670 running Solaris 2.4, operating as Oracle7 DataBase server
1-SS20 running SunOS 4.1.4, operating as application server,
exporting binary and executable code for the application to other
machines listed here
1-SS20 running SunOS 4.1.4, operating as home directory server,
exporting users' home directories to all the other machines listed here
6-SS20's running SunOS 4.1.4, operating as compute engines for users
to log in and execute programs, using executables mounted from the
application server listed above, using home directory mounted from the
home directory server listed above.
all 9 machines above have two TCP/IP interfaces, a public one on le0 and
private one on le1. The mount points are specified in /etc/fstab files as
being from the private (le1) network's machine name (that is, SunMachine_1
is the name on public network, and SunMachine_a is the name of the same machine
on the private network).
The intention of this was to reduce network traffic on the public net and
make transmission of data between the above machines on the private network
faster and more reliable.
My problem is, recently in the /var/adm/messages file of some of these servers,
messages similar to the following have appeared:
SunMachine_6 vmunix: NFS write error 70 on host SunMachine_d fh 410 1
a0000 38556 5515f908 a0000 2 2a884d54
SunMachine_1 vmunix: NFS write error 60 on host SunMachine_d fh 408 1
a0000 12176 2e70bd2e a0000 2 7d4bff42
a0000 12176 2e70bd2e a0000 2 7d4bff42
Batch programs running on these machines fail and files that need to be
produced do not get created, either correctly or in time for the batch process
that needs them.
Has anyone seen similar messages, and can you advise as to the cause or where
else I might look to solve this problem?
-------------------------------------------------------------------------------
RESPONSES and SUGGESTIONS FOLLOW:
Thanks to the following for their responses:
bismark@alta.jpl.nasa.gov
ebumfr@ebu.ericsson.se
jayl@lattice.com
rachel@juno.virago.org.au
sys013@abdn.ac.uk
glenn@uniq.com.au
===============================================================================
check the SUN SYSADMIN FAQ.
For 4.1.X, 60 is "connection timed out".
The "timed out" pretty much means what it says. Either the server
is extremely heavily loaded, or you have more general network problems.
That server is not responding or your net may have too much traffic.
For 4.1.X, 70 is "stale NFS file handle".
The server's file system is not currently mounted.
The "stale NFS file handle" is caused by the following sequence:
- file/directory on the server is opened on the client
- file/directory is deleted on the server by the server
- file/directory is accessed on the client
The is almost certainly an operational/application/user problem,
not an NFS problem.
These messages can arise if you are using quotas and a user tries to
write over quota.
===============================================================================
NFS write error 70 has usually meant that our remote disk was full when a
write was attempted.
===============================================================================
relevant header file: /usr/include/sys/errno.h
Do a 'netstat -i' on the client and server. Either show significant
error counts? There should be almost none.
Do a 'nfsstat -c' on the client. "retrans" and "timeout" should
both be very low, < .001%.
Do a ping -s. Any packet loss? Variation in ping time?
===============================================================================
NFS write error on host variable: No space left on device.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This console message indicates that an NFS-mounted partition has
filled up and cannot accept writing of new data. Unfortunately,
software that attempts to overwriteexisting files will usually
zero out all data in these files. This is particularly
destructive on NFS-mounted /home partitions.
Find the user or process that is filling up the filesystem, and
get the out-of-control process stopped as soon as you can. Then
delete files as necessary to create more space on the filesystem
(large core files are good candidates for deletion). Have users
write any modified files to local disk if possible. If this error
occurs often, redistribute directories to ease demandon this
partition.
For more information on disk usage, see the System Administration
Guide, Volume II. If you are using the AnswerBook, "managing
disk use" is a good search string.
NFS write failed for server variable: RPC: Timed out
====================================================
This error can occur when a file system is soft-mounted, and
server or network response time lags. Any data written to the
server during this period could be corrupted.
If you intend to write on a filesystem, never specify the soft
mount option. Use the default hard mount for all the filesystems
that are mounted read-write.
For more information, see the chapter on NFS troubleshooting in
the NFS Administration Guide.
=========================================================================
The manual page intro(2) lists all the errors:
ETIMEDOUT 60 Connection timed out
A connect request or an NFS request failed because the
party to which the request was made did not properly
respond after a period of time. (The timeout period is
dependent on the communication protocol.)
ESTALE 70 Stale NFS file handle
An NFS client referenced a file that it had opened but
that had since been deleted.
and showfh(8) can convert the file handle (fh) from 8 sets of numbers
to a file name for you:
showfh - print full pathname of file from the NFS file handle
SYNOPSIS
/usr/etc/showfh server_name num1 num2 ... num8
this is the filehandle: 408 1 a0000 12176 2e70bd2e a0000 2 7d4bff42
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:56 CDT