Sun Managers,
Two weeks ago I had a question about the following problem:
-- Begin INCLUDED --
The setup is as follows:
1. Server 'sunsvr' (SunOs 4.1.1, no patches) exports homedirectories
(/export/home) and the mailbox directory (/var/spool/mail).
2. Client 'hpclnt' (HP-UX 8.02, is a HP9000/847) mounts sunsvr:/export/home
and sunsvr:/var/spool/mail.
This setup has been in operation for a couple of month with no unsolvable
problems.
Suddenly users complained about hanging processes on 'hpclnt', especially
ksh and mailx. These processes refuse to be killed by kill -9.
A reboot is the only way to get rid of these hanging processes.
A subsequent analysis showed that the processes (on hpclnt) are blocked
waiting for a lock on files such as /var/spool/mail/$USER or
~/.sh_history on sunsvr.
A network trace with etherfind revealed the following events:
1. hpclnt sends rpc-call nlockmgr proc 7 to sunsvr to request a lock.
2. sunsvr grants lock by sending an rpc-call nlockmgr proc 12 to hpclnt
(udp port 1035).
3. hpclnt replies by an ICMP error message "Bad port 1035".
rpcinfo -p on hpclnt shows that rpc.lockd is indeed listening on port
1032 and not 1035.
...
-- End INCLUDED --
Many thanks for quick replies to:
casper@fwi.uva.nl
mdl@cypress.com
mondics@tartan.com
root@toy.rad.msu.edu
djc@xanadu.acuson.com
derek@ncc.nexus.ca
miker@sbcoc.com
geertj@philica@unido.uucp
The outcome was the following:
The problem is caused by a general 'feature' of RPC, which does not require
that the portmapper is asked for every RPC call. This is a reasonable thing
(for performance reasons), but causes problems, when the RPC client
outlives its RPC server. This is exactly the case, when a NFS client is
rebooted.
In my opinion, the proper solution in the case of the rpc.lockd would be
for SUN to extend the lockd protocol/mechanism, so that the lockd on a
server would be notified in case of client reboots (e.g. by the mountd,
which knows, when a client mounts/remounts a filesystem). After such a
notification the NFS-server lockd could reinitialize its 'connection' to
the NFS-client lockd.
Several people mentioned patches: 100075-07 (not for this problem),
100075-08 (rpc.lockd JUMBO patch), needs 100173-07 (NFS Jumbo).
I decided not to go after the patch, because the problem does not occur
spontaneously, and is under control, now that I know what could cause
hanging processes.
Fix: reboot the NFS server, or restart the server's lockd (dangerous?!?)
Gerhard Hertlein
hertlein@pki-nbg.philips.de
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:41 CDT