My inquiry included:
----------------------
A user here frequently runs a large Fortran program, and once in every
50-100 times it hangs in an uninterruptible wait state. Here's the
output of ps -uax from the current instance:
USER PID %CPU %MEM SZ RSS TT STAT START TIME COMMAND
mizzi 12860 0.0 6.879936 8540 p3 D N 13:26 0:18 model
The machine is a Sun 6/490 with 4 cpus running 4.1.2, with 128 MB of
physical memory.
We think the program is bombing on a read statement, but the exact location
has not been determined. The file being read resides on a disk attached to
the same machine, so it's apparently not NFS-related. If the program is
restarted after a copy goes into the wait state, the second instance does
the same almost immediately.
----------------------
A few responses indicated that it may be a heap allocation bug which can
be fixed by upgrading to 4.1.3 or installing patches 100330 (which may have
been superceded), 100516-01, 100537-01, 100570-01 and/or 100689. We plan to
examine the bugs these patches are supposed to fix and installing those that
may be relevant.
Several responses suggested using trace, adb or ps -axl to help determine
where the problem occurs. The user has been provided a script that will
record a trace when he runs the model (while still writing the model output
to his xterm), and we'll try some of the other ideas should the model hang
again.
Thanks to all who responded:
leclerc@eps.slb.com (Leclerc Francois)
cyerkes@jpmorgan.com (Chuck Yerkes)
jpd@ucs.usl.edu (James Dugal)
tonyr@tekadg.adg.tek.com (Tony Rick)
srm@shasta.gvg.tek.com (Steve Maraglia)
kevin%ups.uucp@fourx.Aus.Sun.COM (Kevin Sheehan)
Amir Ilbeig <ilbeig@math.UH.EDU>
panissec@nms.otca.oz.au (Colin Panisset)
stern@sunne.East.Sun.COM (Hal Stern)
trdlnk!mike@uunet.UU.NET (Michael Sullivan)
wwtz@ciba-geigy.ch (Wolfgang Wetz)
--------------------
Chuck D'Ambra
National Center for Atmospheric Research
P.O. Box 3000
Boulder, CO 80307-3000
USA
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:36 CDT