SUMMARY: ultra60 reboots with memory errors

From: Surendar Dinkar <surender_at_ikos.com>
Date: Wed Jan 16 2002 - 07:04:19 EST
Hi ,

My problem is still not solved  , I got few replies which  suggest that it
is a hardware problem only .I will call sun when my machine reboots next
time( it hasn't rebooted since I posted this message) . I am sending this
summary to share the suggestions I got .I will post another summary when my
problem is completely solved .

Thanks to all those who replied .

I got replies from
roman.pestka@CommerzbankIB.com,
pmora@cgob.junta-andalucia.es,
joe.fletcher@metapack.com,
willief@base-2.com

roman suggested that  "This is a reboot initiated by the improved kernel
Ecache error handling, it
is most likely not a memory issue but a CPU issue." and "this is  a
hardware issue, nothing to do with a PCI card."

pmora had faced similar problems with a 450 , got his cpu replaced thrice
before his problem got solved,  but he dont beleive that all his previously
replaced cpu were bad .

joe fletcher  told me to install latest GLM patches if there a symbios
logic diff scsi card was installed on my machine . Well I dont have any
such card in my system .

willief  suggested to try out few things like  Check to make sure memory
modules are seated perfectly into their sockets. He doubted  that
motherboard could be faulty. He also beleived that PCI card could not cause
this problem. Although he later suggested me to give it a try(replacing or
taking out the pci card) . He writes "The last thing I believe you
mentioned was a PCI card. You never know. Although the PCI bus is isolated
from memory via the PCI interface chip on the motherboard. Its worth a
try."

Thanks
Surender


My original posting

> Hello Managers,
>
> I have an ultra60 machine that used to reboot with some memory  errors.
> I called up sun , they tried to troubleshoot the issue and eventually
> they changed everything including system board , cpu ,memory and eveen
> the power supply! And OS was also reloaded . But the problem did not
> solve even after that . Have anybody of you ever faced such a problem?
> As all my hardware is changed I dont want to beleive this could be a
> hardware issue, Is there any patch available for this problem ? I have
> another doubt in my mind about a PCI card which is installed in this
> system, could the PCI card be the culprit? I cant think beyond patches
> and that little PCI card . Please help me . FYI this error occurs very
> randomly with just about any process , the frequency of the error is
> also not fixed . this time it happened after a month but repeated within
> an hour.
>
> Please help me
> Will summarize
>
> Thanks
> Surender
>
> Errors shown in  /var/adm/messages
> _______________________________________________________________________
>
> Jan  4 11:19:40 jughead unix: WARNING: [AFT1] Uncorrectable Memory Error
> on CP
> U0 Data access at TL=0, errID 0x00007d45.c379075d
> Jan  4 11:19:40 jughead     AFSR 0x00000000.00300000<UE,CE> AFAR
> 0x00000000.09
> 1dec88
> Jan  4 11:19:40 jughead     AFSR.PSYND 0x0000(Score 05) AFSR.ETS 0x00
> Fault_PC
>  0x1625fc
> Jan  4 11:19:40 jughead     UDBH 0x0164<CE> UDBH.ESYND 0x64 UDBL
> 0x03ed<UE,CE>
>  UDBL.ESYND 0xed
> Jan  4 11:19:40 jughead     UDBL Syndrome 0xed Memory Module U0701 U0702
> U0703
>  U0704
> Jan  4 11:19:40 jughead unix: [AFT2] errID 0x00007d45.c379075d
> PA=0x00000000.0
> 91dec88
> Jan  4 11:19:40 jughead     E$tag 0x00000000.18c00123 E$State: Exclusive
> E$par
> ity 0x0c
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x00): 0xd8c2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x08): 0xccc2c4c2.c8c2c402
> *Bad*
> PSYND=0x00ff
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x10): 0xd0c2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x18): 0xccc2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x20): 0xd4c2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x28): 0xccc2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x30): 0xd0c2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x38): 0xccc2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: [AFT2] E$Data (0x38): 0xccc2c4c2.c8c2c4c2
> Jan  4 11:19:40 jughead unix: NOTICE: Scheduling clearing of error on
> page 0x0
> 0000000.091de000
> Jan  4 11:19:40 jughead unix: [AFT3] errID 0x00007d45.c379075d Above
> Error is
> in User Mode
> Jan  4 11:19:40 jughead     and is fatal: will reboot
> Jan  4 11:19:40 jughead unix: WARNING: [AFT1] initiating reboot due to
> above e
> rror in pid 12253 (verilog.exe)
> Jan  4 11:19:45 jughead unix: NOTICE: Previously reported error on page
> 0x0000
> 0000.091de000 cleared
> Jan  4 11:19:56 jughead syslogd: going down on signal 15
> Jan  4 11:20:15 jughead unix: automountd not running, retrying
> Jan  4 11:20:24 jughead unix: syncing file systems...
> Jan  4 11:20:24 jughead unix:  done
> _______________________________________________
> sunmanagers mailing list
> sunmanagers@sunmanagers.org
> http://www.sunmanagers.org/mailman/listinfo/sunmanagers
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Wed Jan 16 06:07:21 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:32 EST