SUMMARY of responses to:
OS 4.0.3 diskless client, NFS errors, hangs with large programs
Problem:
Between zero and five instances of the message occur during boot:
NFS write error 13 on host utig fh 304 1 80000 46e84 1edd0000
80000 46e84 1edd0000
and the client will hang when a large program is run.
Sunview clears the background, but makes no windows.
Small programs run just fine, including NFS, YP, net stuff.
No network or nfsstat recorded errors.
This system has worked fine until a power failure Sunday last.
It has not been reconfigured since March0. I did boot another
kernel (the Xkernel) about 3 weeks ago, using the same ip number
but a different name, but it has been rebooted several times since
then without any problem.
Other diskless nodes w/ same server and config's still work fine.
The server was rebooted, which changed nothing.
Further information:
The "mon" utility was run on the client and either sunview
or vi 1/2megfile was run. The "hang" is caused by a very high rate (>4000
pages/second) of paging, even though there is some free memory.
This paging rate continues *even though vi is idle and waiting for an
interactive command*. System cpu is higher than 50% during all this.
Vi will eventually exit, but sunview will never succeed.
Mouse response is very sluggish (you bet!).
Etherfind -r -host client on another (fast) system shows no apparent
problems during the booting.
If savecore is attempted during rc.local, it fails with Sysmap error.
If L1-A, then g0, is done, panic can't write the dump to the swap space,
error 13 also.
All these things point to a write-locked swap space for the client.
Permissions look ok on the server and on the client. exportfs looks fine.
Comments and possible solutions suggested:
0) use man 2 intro to find out the English for error 13. [yup]
1) make sure root and rw access permitted client in server file /etc/exportfs.
Exportfs -av to make sure. Especially if this client's swap
is not where most of your clients have swap, check this.
[It looked fine, just like the other 4 clients who continued to work fine.]
2) make sure client's /tmp is mode 777 or 1777 [ok]
3) rm /etc/xtab and exportfs -va on server. [see below]
4) this problem showed up in 4.0.1 for someone, but I don't
know how/if it was ever resolved.
5) 4.1 is purported to have new utility "showfh" to print info about
the NFS file handle (which is what the string of hex digits above are).
There is also an rpc.showfhd, but at least one person has had trouble making
it work.
6) see /usr/share/sys/nfs/nfs.h for some idea of what is in the file
handle.
auspex!guy@uunet.UU.NET (Guy Harris) writes:
< Dunno if you know this already, but cracking that as a 4.x-style file
< handle says that the file in question is probably on "/dev/xy0e" (0304
< is major 3, minor 4, or "/dev/xy0e"), and either inode 70 (0x46) or
< perhaps 290436 (0x46e84) - I think it's 70 (the 304 and 1 are 8 bytes of
< file system ID, the 1 just being the index of UFS in the VFS table, the
< next 0008 being the length of the UFS file ID, of 8 bytes, the 00000046
< being the inode, and the 6e841edd being the "generation number" in the
< inode).
< What's odd is that the second file handle in there, which is the file
< handle of the "export point", is identical; it's normally a directory.
< I think when exporting the swap file to a diskless client, the file
< itself is the "export point". I wouldn't be surprised to find that the
< swap files for the clients are on "/dev/xy0e", and that inode 70 (or
< whatever) is the client's swap file.
In the other examples of similar NFS err 13 problems I was sent, the two
were in fact different.
What fixed my problem:
In short, Pure Fantastic Magic.
Having beat my head against this problem for a couple of days,
I started to change many things at once.
0) wait a day. It is now Wednesay, not Sunday, Monday or Tuesday. :)
1) edit /etc/exports and remove some old references to hosts no longer
in hosts file. Nothing to do with this client.
2) rm /etc/xtab, exportfs -av
Thanks to Tad Guy <tadguy@abcfd01.larc.nasa.gov>
3) edit /etc/rmtab and clear out some old garbage which accumulated when
we used full domain names for hostnames (eg foo.ig.utexas.edu instead of
the short name foo). I point out that these extra names exist for all of
our diskless clients.
Reboot client - and it works. It still works. I don't know why.
I have restored references to old hosts in /etc/exportfs, and it
still boots. I can't restore rmtab to its former state.
Conclusion: Most likely, item 2 is the relevent change.
Contact me by mail if you want the raw comments.
Thanks to all who responded.
mw
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:58 CDT