Summary: E4500 Reboots on fatal error

From: Mohamed Lrhazi <mohamed_at_fluidsoft.com>
Date: Fri Oct 11 2002 - 10:48:34 EDT
Hello all,

You wont believe this, but in addition to several suggestions via email
on how to go about diagnosing this issue, we also received a phone call
from the people we purchased the server from, they are sending us a new
system board!!! these people are serious aren't they? I do not have the
name of the company, oddly enough, so I cannot mention them.

Anyways, here are the suggestions; I will go through them after I got
the new system board, I also installed SunVTS 5.0 and will have it check
the whole thing.

Also, prtdiag -v gives this unequivocal report :

Failed Field Replaceable Units (FRU) in System:
==============================================
SUNW,UltraSPARC-II unavailable on CPU Board #0
        PROM fault string: fail
        Failed Field Replaceable Unit is UltraSPARC module Board 0
Module 1


Thank you all,
Mohamed~


On Thu, 2002-10-10 at 19:07, Tony Walsh <Tony.Walsh@Sun.COM> wrote:
> 
> The "(Score 05)" part of this particular message indicates that CPU1 has a
> 5% chance of being the cause of this Ecache error, so in this context CPU1
> is NOT a target for replacement. At some point earlier in this stream of 
> messages you should see a "(Score 95)" indicating a particular CPU has a 
> 95% chance of being faulty. If you find this "Score 95" then you should 
> change that CPU out, but if you don't see it, you may then have a memory 
> issue or some other condition to indicate what you original fault may be.
> 
> You will need to find this "Score 95" message to be more sure.
> 

On Thu, 2002-10-10 at 13:03, kboykin  <kboykin@coserv.net> wrote:
... 
> You might need to limit the ecache to 4mb (if they are 8mb)as a
> workaround to an ecache scrubbing problem.
> 
> I don't see a CPU panic in there...but it's possible that CPU1 is bad.
> You can disable a CPU from the OS:
> 
> psrinfo to see the status
> psradm -f (the id of the CPU you want to take offline, ie, 1)
> psradm -n (the id of the CPU you want to bring online)
> 
> And you can always try to reseat the CPUs, sometimes there are contact
> problems with 4500 CPUs.
> 


On Thu, 2002-10-10 at 12:42, mike.salehi@kodak.com wrote:
> 
> It could be the fan...
> Anyway if you do not or cannot fix it you have to get 
> that board out of there, you could transfer all memory to the 
> other board.


On Thu, 2002-10-10 at 12:25, Tim Chipman <chipman@ecopiabio.com> wrote:
> Based on this line,
> 
> Oct 10 03:39:51 ganymede     E$tag 0x00000000.0e402006 E$State: Shared
> E$parity 0x07
> 
> it suggests that you may have E-cache error on one of your CPUs. A
> pretty common problem with e3500 (8mb cache) UltraSparcII CPUs.



On Thu, 2002-10-10 at 20:53, Hichael Morton <mh1272@yahoo.com> wrote:
...

> the first thing to do is retorque all the CPUs.  (the user/service
manual and order the system engineer handbook will have information on
this.  it requires a specific torgue settings and a torque wrench.)
> 
> if re-torqueing doesn't help, you can try swapping the boards to see
if the error message follows the CPU.
> 
> while you have all the server "open", make sure the memory modules are
configured properly.  (the above manuals/documentation will have this
information also.)
> 
> if you are in the Knoxville, TN are, let me know.
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Fri Oct 11 10:47:10 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:56 EST