SUMMARY: frequent sun4m hangs (network?)

From: Margarita Suarez (marg@watsun.cc.columbia.edu)
Date: Mon Jun 09 1997 - 15:07:35 CDT


[ My original post follows. ]

It may be too early to tell, but we seem to have solved memory
allocation hangs by upgrading the kernel to SunOS 4.1.4. We're running
4.1.3_u1 binaries and libc with up-to-date recommended patches, except
the kernel is 4.1.4 with the following patches installed:

        102394-02 NFS Jumbo
        102436-02 kernel hangs
        102516-05 UFS patch (statd security)

We have made it for four days now without hanging (was once or twice a
day). Let's see if posting this message will cause the hangs to reoccur.
:-)

Kudos to Rich Kulawiec <rsk@itw.com> for the suggestion to upgrade to
SunOS 4.1.4:

> I'd suggest going to SunOS 4.1.4 plus the patches; that's a pretty easy
> move from 4.1.3_U1 and at worst, should do no harm -- at best, it may
> solve the problem. The reason that I suggest this is that I've observed
> shared memory allocation problems in SunOS 4.1.3 that don't seem to
> be there in 4.1.4. The situation *was* somewhat unique, but in short
> attempting to deallocate shared memory segment which was still attached
> to a process didn't seem to work, and the consequence of this was
> memory exhaustion -- although on Sun4/Sun4c hardware, it didn't result
> in hangs.

> So, in other words, I have circumstantial evidence that suggests some
> sort of memory allocation problem in the 4.1.3 kernel as well; but I've
> never managed to nail the bug exactly. I just stepped around it by
> going to 4.1.4, where it doesn't seem to be happening. It's possible
> that you're seeing symptoms of the same or related bug, and that going
> to 4.1.4 will get you through the next few months.

When I called to close our call with Sun, the engineer stated that this
explanation was congruent with evidence from the crash dumps we sent
him. I asked him to put this information in the call.

Additional thanks to Larry L. Loreman <lloreman@mail.state.tn.us> who
suggested increasing maxusers. We already had maxusers set at 256 which
seems to be even over the maximum of 200 for sun4m recommended by the
Sun engineer. Numbers higher than 256 prevented the kernel from booting
(too big). Also, apparently mbufs are not dependent on maxusers.

Alan Thew <Alan.Thew@liverpool.ac.uk> said he had experienced similar
hangs which desisted after he upgraded to Solaris 2.5.1.

thanks everyone!

marg@columbia.edu

Margarita M. Suarez
Columbia University
UNIX Systems Group

--------------
Original Post:

hello sun managers,

we have a bunch of sparc20's with 200 MB of memory running SunOS
4.1.3_U1 with current recommended patches. each has around 35
X-terminals hanging off of it. most sessions are running netscape 3.01.

several times a week (almost daily), one of these machines will hang and
need to be crashed from the prom. they still answer pings, but
processes seem to have stopped (no response on console, cron stops
running, can't telnet -- nothing seems to be getting any cycles). at
the time we crash it, the system appears to be running in kernel space,
allocating mbufs madly.

for example, the just before the system hung, netstat -m showed the
following:

    1701/2016 mbufs in use:
            299 mbufs allocated to data
            233 mbufs allocated to packet headers
            448 mbufs allocated to socket structures
            685 mbufs allocated to protocol control blocks
            10 mbufs allocated to routing table entries
            3 mbufs allocated to socket names and addresses
            20 mbufs allocated to zombie process information
            3 mbufs allocated to interface addresses
    10/36 cluster buffers in use
    288 Kbytes allocated to network (77% in use)
    0 requests for memory denied
    0 requests for memory delayed
    0 calls to protocol drain routines

    streams allocation:
                                             cumulative allocation
                          current maximum total failures
    streams 13 16 476 0
    queues 52 64 1902 0
    mblks 26 177 693436 0
    dblks 26 177 693436 0
    streams buffers:
    external 0 0 0 0
    within-dblk 0 132 278216 0
    size <= 16 0 5 5762 0
    size <= 32 0 3 24954 0
    size <= 128 26 86 364673 0
    size <= 512 0 2 8375 0
    size <= 1024 0 2 11126 0
    size <= 2048 0 1 330 0
    size <= 8192 0 0 0 0
    size > 8192 0 0 0 0

when we crashed the machine about 1/2 hour later, we saw this (gotten
from the crash dump):

    12729/12800 mbufs in use:
            11214 mbufs allocated to data
            224 mbufs allocated to packet headers
            435 mbufs allocated to socket structures
            664 mbufs allocated to protocol control blocks
            10 mbufs allocated to routing table entries
            159 mbufs allocated to socket names and addresses
            20 mbufs allocated to zombie process information
            3 mbufs allocated to interface addresses
    21/36 cluster buffers in use
    1636 Kbytes allocated to network (98% in use)
    0 requests for memory denied
    0 requests for memory delayed
    0 calls to protocol drain routines

    streams allocation:
                                             cumulative allocation
                          current maximum total failures
    streams 13 16 476 0
    queues 52 64 1902 0
    mblks 44 177 693645 0
    dblks 44 177 693645 0
    streams buffers:
    external 0 0 0 0
    within-dblk 12 132 278335 0
    size <= 16 6 6 5769 0
    size <= 32 0 3 24962 0
    size <= 128 26 86 364742 0
    size <= 512 0 2 8380 0
    size <= 1024 0 2 11127 0
    size <= 2048 0 1 330 0
    size <= 8192 0 0 0 0
    size > 8192 0 0 0 0

note that the number of mbufs allocated has increased by an order of
magnitude.

sun tells us it's a bug in netscape, our machines are too small to
handle the load, etc. but these machines run fine for a few days before
this happens, so we think there's either a kernel memory leak or some
other kernel bug causing it to freak out suddenly.

we plan to upgrade to solaris this summer, but in the meantime we need
to get some sleep.

can anyone help?

thanks

marg@columbia.edu

Margarita M. Suarez
Columbia University
UNIX Systems Group



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:56 CDT