SUMMARY: Sun4 dies spontaneously

From: Gustavo Vegas (gustavo@davinci.concordia.ca)
Date: Mon Oct 05 1992 - 15:58:05 CDT


Hello to all sunmanagers,
        This is a long due summary. A long while ago, I wrote:

        We are having the most bizarre experience with our main server
here. The machine has crashed about 6 times today, and I hope to be able
to put this message on the wire before and if it does again. I excuse myself
for the lenght of the message, but I want to cover every possible thing that
may help on diagnosing this problem.
[....stuff deleted...]

Background: This machine has been worked fine with the current setup for
                a number of months. The last modification, was about a
                week ago, that we added a previously used and recycled
                CDC 9720-1230MB disk drive, which gave me a headache
                some people here may remember :-). This disk seemed to
                be behaving well. Today, out of the blue, the machine
                crashed with a "BAD TRAP" Data fault[message from dmesg]:
                BAD TRAP
                pid 148, `nfsd': Data fault
                kernel read fault at addr=0xe0000007, pme=0x70000080
                Bus Error Reg 0
                pc=0xf8006bac, sp=0xf8114fa0, psr=0x11900ec0, context=0x0
                g1-g7: e0000007, 8000000, ffffffff, 85, f82cf000, f813dc00, f813dc00
 
                The machine rebooted, saved kernel, crashed again,
                the same error, half way through rebooting crashed,
                the same thing. We brought it back in single user mode,
                and analyzed the core dumps from the 2 crashes, finding
                that the error was always in the same place, in a kernel
                routine. Running adb -k vmunix.X vmcore.X ( X from 0 to 6)
                we always found very similar stuff, like:

                adb -k vmunix.2 vmcore.2
                physmem ffd
                $c
                _panic(0xf813551b,0xf8114f54,0xe0000007,0x0,0x7c08b,0xf82ceb28) + 6c
                _trap(0x9,0xf8114f54,0xe0000007,0x0,0x1,0x50) + 2a0
                fault(0x0,0x1000,0x23a000,0x110005e0,0xff034198,0x80) + 94
                _vme_read_vector(?)
                pri_set_common(0x700,0x119007e2,0x119000e2,0xf813e71a,0xfd8092f0,0xf813e718) + 48
                _rrokfree(0xff656c80,0x0,0x5c8,0x0,0x110000e2,0x0) + 108
                _svckudp_send(0xfd844d08,0xf82cec48,0x1,0xfd846fc4,0xff65ba80,0x2aa8cc60) + 140
                _svc_sendreply(0xfd844d08,0xf80334ac,0xfd86cde0,0xfd8549e0,0x0,0xfd846fac) + 48
                _rfs_dispatch(0xf8124edc,0xfd85cfa0,0xfd844d08,0xf812d330,0x1,0xfd85cfa0) + 790
                _svc_getreq(0xfd844d08,0xfd8472b0,0x0,0x0,0xffffffff,0xfd84ffb8) + 11c
                _svc_run(0xfd844d08,0x186a3,0x2,0xf8028b5c,0x0,0x110000e7) + 4c
                _nfs_svc(0xf82cefe0,0x4d8,0xf8125fb8,0xf8126490,0xf82cf000,0xf8126490) + 260
                _syscall(0xf82cf000) + 3b4

                Checking out the pc where the fault happen, the pc was,
                for every crash, the same one(since it was a kernel routine,
                this may be obvious):
                f8006bac,4?ia
                _vme_read_vector:
                _vme_read_vector: lduba [%g1] 0x2, %l5
                _vme_read_vector+4: wr %l0, 0x20, %psr
                _vme_read_vector+8: cmp %l5, 0xff
                _vme_read_vector+0xc: bg _spurious
                _vme_read_vector+0x10:

Answers: I kindly thanks all that answered. Three people answered my query:
================================================================================
From: Kevin Sheehan {Consulting Poster Child} <ups!kevin@fourx.Aus.Sun.COM>
I think you tickled the "bad vector" bug. On sun4 machines with a VME
bus, they will handle a spurious interrupt okay (the spurious routine
gets called) but spurious/bad VME vectors cause a problem.

In the vector table are a pointer to function and pointer to arg. The
system sets uninitialized vectors to be a pointer to _<something>_spurious.
The problem is that the arg pointer is 0, and the routine dereferences it
first!!

My guess is that you are not getting 0xff (no vector at all) but a bad
vector, and the system is going mammary glands skyward as a result...

        l & h,
        kev

> to what I responded:
> O.k., this sounds good, but, is there any way to avoid this from
> happening again?

From: Kevin Sheehan {Consulting Poster Child} <ups!kevin@fourx.Aus.Sun.COM>

One way is to patch all the 0's to the value of something (I wrote a
quick hack to just move the routine address into the arg address) that
won't bite it. As far as I know, the bug wasn't fixed. A quick adb
of the vme_vector table should tell you pretty quickly.

                l & h,
                kev

================================================================================
From: Perry_Hutchison.Portland@xerox.com
On an SS1, I have seen "data fault" and "text fault" problems become
much less frequent after swapping CPU and memory, then go away entirely
after upgrading from 4.0.3c to 4.1.1b. The OS upgrade necessitated
upgrading a third-party streams driver, so the finger cannot be pointed
with certainty.
================================================================================
From: Hal Stern - NE Area Systems Engineer <stern@sunne.East.Sun.COM>

looks like a known bug fixed with the NFS jumbo patch, but to
be sure, generate a symbolic traceback (directions attached)
and see where it died. that's the only thing that's useful
for locating the exact problem.
================================================================================

To these welcomed answers, I mailed a second post, that only got answered by
Kevin Sheehan, and this second answer is included with the first answer. My original
post included symbolic traces of the crashed kernel, an a detailed description
of the problem. This kernel that crashed has an NFS Jumbo patch, and is the
same that is now working fine with a new CPU board. In any case, we contacted Sun,
and their technical rep. came over with a new CPU board. At this time, the machine
has been up ever since the CPU board got changed(Sept. 21th).

Thanks,
--------
  ==== === ==== =======================+===========================
  = = = = Gustavo Vegas gustavo@davinci.concordia.ca
  === = === Systems Analyst Concordia University
  = = = = Dept. of E&CE Montreal, Canada
  ==== === ==== =======================+===========================



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:50 CDT