Summary: SPARC 1000 Crashing, Cause ??

From: Gregory M Polanski (gmp@adc.com)
Date: Tue Aug 06 1996 - 10:00:22 CDT


The crashing S1000 has been fixed.

he SPARC 1000 crashes may be caused by 2 items. Sun Support replaced
the HSI card which was definitely bad and the OS was patched to
101945-41, which is part of the 2.4_Recommended.tar.Z cluster.

1. Bad HSI/S card. Finally, this card failed after
        a reboot (accidently done before patching OS)
        The following message appeared on the console.
        Also, the system did NOT load the HSI driver.
        
        Invalid FCode start byte at feee0000
        
        Unable to install/attach driver 'HSI'

2. Need For Jumbo Kernel Patch 101945-41. The patch
        included a fix for another S1000, running Ingres,
        that was crashing.

RESPONSES

> From steve_turgeon@ppc-191.putnaminv.com Wed Jul 31 16:46:09 1996
>
> First apply all of the recommended patches, especially the 101945-41 kernel
> patch.
> Run /usr/kvm/prtdiag -v and see if there are any errors.
________________________________________________________________

> I saw our Sparc 1000 crash with a Bad Trap Data Fault before. What was
> happening was a program was chewing up all the free ram until the machine
> panic'd. The messages >>Xremote: Data fault<< makes me think that it
> is a remote xterm or x-emulation software that is creating the panic.
>
> Don Catey Internet: catey@wren.geg.mot.com
> Systems Administrator Compuserve: 103533.2772@compuserve.com

________________________________________________________________

> From twhite@bear.com Wed Jul 31 11:28:59 1996
>
> 1150668 bug Data fault panic in strpoll()
________________________________________________________________

> From vpopa@dss.mc.xerox.com Wed Jul 31 11:56:42 1996
> I had the same problem a year or so ago with a ss2000 The problem
> was with a bad CPU. ( I had 4 of them twice !!!)
________________________________________________________________

>From Maryanne_Baker@natwest.com.au Wed Jul 31 19:50:53 1996
     If you really want to get down and dirty there is a book called Panic!
     published by O'Reilly which is very useful for deciphering messages
     like the ones below.
________________________________________________________________

> From root@gmsn0008.gmsn.uk.eds.com Mon Aug 5 05:30:51 1996

>
> I had a very similar problem recently,
> with almost Identical error messages -
>
> The friendly neighbourhood Sun hardware enineer almost
> replaced every single part of the machine. until
> we traced it down to a bad disklabel on one of the
> disks. every time the machine tried to probe the hardware
> it would cause the crash
>
> Adrian Singh -- Freelance System Admin
> ________________________________________________________________

PROBLEM STATEMENT

> A SPARC 1000 has crashed twice in 1 week. This is very unusual.
> Should I call field service to replace parts? Which is most likely part?
> What is the most likely cause? Or where should we look first?
>
> We also NCD PC-Xremote and 'Xremote' shows up in the log.
> Has anyone had any experience with this software causing problems?
> (I doubt it, we have been using PC-Xremote for 2+ years.)
>
> /var/adm/messages from one fault and dmesg from the other fault follow.
>
> Thanks
>
> greg
>
> gmp@adc.com
>
> dmesg output
>
> dump on /dev/dsk/c1t11d0s1 size 262568K
> BAD TRAP: cpu_id=0 type=9 <Data fault> addr=18 rw=1 rp=e1da8c74
> MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> level=3
> MMU sfsr=0x326<FAV>
> Xremote: Data fault
> kernel read fault at addr=0x18, pte=0x3
> MMU sfsr=0x326: ft=<Invalid address error> at=<supv data load> level=3
> MMU sfsr=0x326<FAV>>
> This fault shows cpu_id=3. The previous crash was cpu_id=0.
>
>
> /var/adm/messages, July 31
>
> Jul 31 10:21:38 nelogix unix: BAD TRAP: cpu_id=3 type=9 <Data fault> addr=18
> rw=1 rp=e1eafc74
> Jul 31 10:21:38 nelogix unix: MMU sfsr=0x326: ft=<Invalid address error>
> at=<supv data load> level=3
> Jul 31 10:21:38 nelogix unix: MMU sfsr=0x326<FAV>
> Jul 31 10:21:38 nelogix unix: Xremote: Data fault
> Jul 31 10:21:38 nelogix unix: kernel read fault at addr=0x18, pte=0x3
> Jul 31 10:21:38 nelogix unix: MMU sfsr=0x326: ft=<Invalid address error>
> at=<supv data load> level=3



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:06 CDT