SUMMARY: system panic and mpstat

From: Schultz, Juergen (Juergen.Schultz@m.dasa.de)
Date: Wed Jun 30 1999 - 10:44:00 CDT


Dear sun-managers,

my original question was :

I have a Ultra-1 Sparc machine with a 200 Mhz cpu (propoerly patched) which
paniced with an Ecache Data Parity Error. I found some entries in the
database of sun-managers, saying that some CPUs had a problem with the
cache. But I do not know, if the revision of 270-2702-04 Rev.: 01 (standing
on the bottom of the processor card) is part of the defective series. Can
anyone tell me please ?

What does the number 270-2702-04 Rev.:01 mean number by number ? Is there a
special meaning in these numbers at all ?

Another question is about the output of mpstat :
My server is an Ultra-2 at 2x300 MHz (even properly patched). In the last
days we had some panics with "Copyout Data Parity Error" and mpstat says
that there are up to 160 minor and 1600 (!!) major fault on the cpu. What do
minor and major faults mean ? Do we have a hardware failure ?

Thank you for all replies !

Answers in no order from :
Ray Delany
Kevin
Robert Hill
Eearl Locken
Thomas Anders

The anser (from Earl Locken) was:

>Dear sun-managers,
>
>I have a Ultra-1 Sparc machine with a 200 Mhz cpu (propoerly patched) which
>paniced with an Ecache Data Parity Error. I found some entries in the
>database of sun-managers, saying that some CPUs had a problem with the
>cache. But I do not know, if the revision of 270-2702-04 Rev.: 01 (standing
>on the bottom of the processor card) is part of the defective series. Can
>anyone tell me please ?

     The CPU is bad. It may not be a manufacturing defect, it could
just be an electrical component on the module starting to fail.
...

>Another question is about the output of mpstat :
>My server is an Ultra-2 at 2x300 MHz (even properly patched). In the last
>days we had some panics with "Copyout Data Parity Error" and mpstat says
>that there are up to 160 minor and 1600 (!!) major fault on the cpu. What
do
>minor and major faults mean ? Do we have a hardware failure ?

    A copyout error occurs when one CPU reports that another did not
respond within a timeout. The destination CPU, not the one reporting,
is the one that is defective. Odds are the ecache errors and the copyout
errors show the same defective CPU.

     Major faults and minor faults are paging statistics. Major faults
means the OS had to go all the way to disk to get the page. Minor faults
mean the OS found the page still cached in RAM even though the page was
no longer referenced. Minor faults are a significant performance win.

Many thanks again, that solved the problem
Many thanks even at this way to Mr. Locken.

Juergen Schultz
Juergen.Schultz@m.dasa.de



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:23 CDT