Summary:E450 experience panics

From: George DONE (george@romsys.ro)
Date: Thu May 18 2000 - 04:13:30 CDT


Hi Sun Managers !
Here is the summary for the above problem:

The original question:
>Dear Friends,
>I have an E450 server with 4 x 400MHz/4MB cache installed. The server
>have also 1GB RAM (2x X7005) 4 x 9.1 GB/10000 rpm drives, one external
>A1000 with several Dual Differential SCSI controllers.
>Solaris 2.5.1 with 103640-32. Also running Volume Manager 3.0.2 Veritas
>File System 3.2.6 and Raid Manager 6.1.1 Update 2. The use of this
>system is as a server for Catia 5 and Euclid binaries.
>This sistem is experiencing panic's from some time. (Several per week).
>I would like a sugestion to solve this problem. Some coleagues of mine
>think we have a faulty CPU, but as you can see from message below,
>different CPU's are involved each time.
>In my opinion, could be the mother board, the memory (more probable) or
>the Solaris 2.5.1 (even if latest patches are installed).
>Pls. let me know what do you feel to generate the problem ?
>Without an advice, all I have to do is to enter in an endless test
>procedure replacing CPU's one by one, memory, etc.

The solution:
We noticed most panics are encountered when hme2 interface are plumbed
up.
Se we taked out the hme2 PCI network interface, and since then we did
not crash the server again.
Strange enough, the server was not reporting any network errors and that
interface was used to transfer large files without problems.

Many thanks to:

Birger Wathne

First of all, enable savecore in /etc/init.d/sysetup
This will enable use of tools such as iscda after the next crash.

Second, check if powerd is running. If it is, make neccesary config
changes to make sure you don't run it. I have seen odd crashes with
powerd
running on 450. You don't want power management on a server anyway.

 
           Michael Faucher

I have a system with almost exactly the same configuration, which
experienced
very similar problems about one year ago. Sun replaced every component
on this
system including the motherboard. What it came down to was a CPU, but
Sun
needed to replace both of them at the same time as well as the
DC/DC converters. That is critical. One thing that helped me in the
diagnosis
was that I could take one cpu off-line and the system would not crash. I
know
it sounds fluky that I could take either cpu off-line and it would run
but that
is what worked for me.

                Scott Clark

You should enable savecore and send the core files to Sun for cause
analysis. This should be covered by most of the support contracts.
Your machine has a pretty complex set of drivers and your panics
could be caused by software incompatabilities.

                -Rob Rheault

I believe it has something to do with the model of 400MHz cpu's your
using.
Sun as identified that there are some that are known to have problems.
You
should probably check with them.

                Mark Deady

I've just experienced a similar problem with 1x400MHz E250. Our machine
would panic at bootup with a similar message to the one you experienced.
unix: BAD TRAP: cpu=1 type=0x31 rp=0x3047d578....
 This occured whether we booted from disk cdrom or net. Occasionally it
would boot OK and then it would run fine until the next reboot. We had
the
CPU replaced and this appeared to fix it for a while. It started
happening
again and more frequently so we had the main board replaced. It now
appears
to be working OK. We've had Sun study the diagnostics but they just
decided
to replace the CPU followed by the motherboard.
Hope that helps.

 
        "Kulp, Scott (Scott)** CTR **"

Privalege UE errors to me have always been memory. reseat and rearrange
the
memory so a new bank is in the 0 position and see if it exposes the
error.

sometimes non-sun and sun memory together will cause UE problems

        
        nasser@who.net

I faced a similar problem. Mine is a one-cpu with
1GB RAM, Solaris 2.6, Veritas 3.0.2. I emailed the
sun-managers, no very specific response. Messages I
get at the panic sound like CPU/Cache problem, but I'm
not sure. The last thing that happened (I do not
know if this is related or not) was that the power
supply burned out and the system just went down forever,
I'm not telling this will happen to you, but just keep
it somewhere in your mind that maybe you experience
this...

I called Sun and we're in the process of changing power
supplies right now, so I have no newer info about the
CPU card.

Anything specific you'd like to know about my case?
Please keep me informed with your progress, maybe we
both have the same problem...

       Thomas Carter

 We had a 450 with this same panic error that you have:

       panic[cpu0]/thread=0x30023ec0: CPU0 Priv. UE Error <misc
       numbers snipped>

       The box would panic and reboot every 4-6 hours. That was 5
       months ago, and the machine has been up ever since. Call Sun
       support and see if this is the case on your box.

 
           Brian Scanlan:

type=0x31 means Data access MMU miss. Faulty CPU?

> May 9 03:25:29 arges unix: panic[cpu0]/thread=0x30023ec0: CPU0 Priv. UE
> Error: AFSR 0x00000000 80200000 AFAR 0x00000000 0dd89f08 SIMM 190x

This looks like a faulty 400mhz CPU. Log a SUN support call.

Sun have had more than their fair share of problems with their 400mhz+
CPU's.
The problems have been sorted, but there's still broken chips out there.

        Hooman Abrishami

I suspect you have a (or more) bad CPUs. I think I heard there is a
problem with a series of 400-4MB cpus. Call your sun support.

        Fernando Nantes de Souza

Start with the memory. We had the same problem and after a long and
painful process where we replaced everything, including the mother
board and cpus, the problem finally disapeared when the memory was
replaced.

 
        "Balfour, Scott (Eurosoft)"

looks like a memory problem. Sometimes it shows up as a cpu
error and moves around because all cpu's access the same memory.
Sun should be able to tell you which simm using the AFSR and AFAR.
>>May 9 03:25:29 arges unix: panic[cpu0]/thread=0x30023ec0: CPU0 Priv. UE
>>Error: AFSR 0x00000000 80200000 AFAR 0x00000000 0dd89f08 SIMM 190x

This error could be cpu1 getting bad data from ram
>>May 8 09:18:30 arges unix: BAD TRAP: cpu=1 type=0x31 rp=0x3047d578
>>addr=0x216c5f0c mmu_fsr=0x0

 
        Nathan Dietsch

Do you not have a sun contract? The field engineers usually figure this
out
quick smart.
Have you got the crash dumps. If you do, send them to sun. If not, there
is
a Book on the market entitled Panic! which will teach crash dump
analysis.
I will be buying it soon.

Regards,
George Done



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:14:08 CDT