SUMMARY: Non-memory-related Correctable ECC error

From: Deb <deb_at_tickleme.llnl.gov>
Date: Fri Jul 19 2002 - 14:33:19 EDT
The original post:

In the overnight logs on one of our E250's running Sol 5.7, this error
was logged:

unix: WARNING: correctable error from pci0 (upa mid 1f) during dvma read transaction
unix: AFSR=3D40380000.1f800000 AFAR=3D00000000.66b98d00, double word offset =3D0, Memory Module U0804 id 31.
unix: syndrome bits 38
unix: Non-memory-related Correctable ECC Error.
unix: ECC Data Bit 15 was corrected

What does this error mean, and does this mean that MM U0804 ought to be replaced?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There were 2 answers, both along the same lines as I was thinking.  But
to be more complete, I did some research and found that these warnings
indicate that the system was able to "fix" a twisted bit.  I believe
that only one bit can be "corrected" in this way, > one bit cannot.

I look at this error as an indicator that the U0804 module is suspect,
and if we see the error start showing up again soon, replace the
module.  (Although Sun may suggest R/R the entire bank.)  Errors like
this have also been known to indicate a CPU problem, but I have to 
research this more.  It sounds intriguing.


Many thanks to my two respondants who had this to say:

Kevin Buterbaugh -

"     I wouldn't schedule maintenance to replace that SIMM since the error              
was detected and corrected.  However, if you did already have maintenance              
scheduled, I would go ahead and replace it.  It might be failing and it                
definitely needs to be monitored closely.  HTH..."

Hichael Morton -

"1. This is a "WARNING" not an error!                                                   
                                                                                       
2. Your memory system worked the way it was designed:                                  
   "unix: ECC Data Bit 15 was corrected"                                               
   Your memory system corrected the problem.                                           
   This is why ECC memory is installed                                                 
                                                                                       
3. If there were an ERROR or failure,                                                  
   "Memory Module U0804" only indicates the bank that                                  
is reporting the problem and does not necessarily                                      
indicate the offending memory module.                                                  
   When we replace memory, we replace the entire Bank!                                 
   This is what Sun teaches in their hardware classes."
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Fri Jul 19 14:38:24 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:49 EST