SUMMARY: asynchronous memory fault (bad SIMM?)

From: Carlos Canau (canau@dawn.EUnet.pt)
Date: Thu Jul 09 1998 - 10:33:55 CDT


        Many thanks to the following people:

Dave Floyd
Theng LY
Caleb Warner
Val
Damon LaCaille
Don Cheesman
Doug Otto
Jonathan.Loh
Bismark Espinoza

        I have not yet solved the problem nor got to a conclution. I'll
wait for the next panic to check for the message and than probably
replace the SIMM. The check bad disk possibility will go after replacing
the SIMM.

        best regards,
        </canau

        Here are the answers:

        Pointing to SCSI or disk error:
        """"""""""""""""""""""""""""""

-----------------------------------------------------------------------
        This is a SCSI error " Cmd dump for Target 2 Lun 0:"

        You may have a disc on its way home or other scsi bus problems.
-----------------------------------------------------------------------
Your disk can be also bad
-----------------------------------------------------------------------
Have you added any new disks? The one at scsi id=2 .
-----------------------------------------------------------------------

        Pointing to a bad SIMM:
        """"""""""""""""""""""

-----------------------------------------------------------------------
The MFAR is the important value. It tells what memory address had the problem.
To map the memory address to a SIMM use the following table:

Slot Memory Address
J0301 02000000 - 03ffffff
J0302 04000000 - 05ffffff
J0303 06000000 - 07ffffff
J0400 08000000 - 09ffffff
J0401 0a000000 - 0bffffff
J0402 0c000000 - 0dffffff
J0403 0e000000 - 0fffffff

In your case, with an MFAR value of c609f00, the SIMM in slot J0402 should be
replaced.
-----------------------------------------------------------------------
Best bet amd probably the most accurate is to run SYMON (bundled with the OS)
I saw errors like yours that ended up beeing faulty CPU's mot SIMM's
-----------------------------------------------------------------------
  This is EXACTLY what happened to us on our Sparc 5, actually two of them!
 It started happening to us right around summer time because the air
conditioners here in the building are shut off at night during the summer,
but they are like a "booster" for the computer room to keep the computer
cool, however. So during the winter, no problem, but during the summer, it
gets awefully hot in there.
  Anyhow, we believe the SIMMs went bad due to temperature. HOWEVER, they
were also in a 3-SIMM configuration for the longest time, and apparently
folks recommend upgrading these in pairs of TWO. So we replaced the 96MB
configurations with brand new 128MB configurations (4 32 MB SIMMs).
Everything is working beautifully. We were getting the exact same errors
you were getting, try taking memory out from another computer you know the
memory is good in and try that. Unfortunately your boot parameters
(check-#megs or whatever on bootup) doesn't catch the bad memory, though I
don't know why?
-----------------------------------------------------------------------
Check out this excerpt from INFODOC 11510 which can
be retrieved from sunsolve.sun.com

[...]
> Aug 9 01:22:10 wsplcp7 unix: panic: asynchronous memory
> fault:
> MFSR=80802820 MFAR=6190710
>
[...]
>
> example of SS5 memory slot layout:
>
> J0403 SIMM7 RAS 7 0e000000 - 0ffffff
> J0402 SIMM6 RAS 6 0c000000 - 0dfffff
> J0401 SIMM5 RAS 5 0a000000 - 0bfffff
> J0400 SIMM4 RAS 4 08000000 - 09fffff
> J0303 SIMM3 RAS 3 06000000 - 07fffff ** 0x06190710 falls here
> J0302 SIMM2 RAS 2 04000000 - 05fffff
> J0301 SIMM1 RAS 1 02000000 - 03fffff
> J0300 SIMM0 RAS 0 0e000000 - 01fffff
-----------------------------------------------------------------------
There is a Sun white paper on this. The cause is almost always 3rd
party 32meg simms. We've had decent luck with Kingston and Dataram.
Viking on the other hand seems especially prone to the problem. Sun's
recommended fix (of course) is to buy Sun RAM, which typically cost
twice as much. The problem only appears in 32 meg sticks.
-----------------------------------------------------------------------
Yes this probably is a bad simm. As to how to narrow it down, I do not
know. You might try memconf or some other memory util. Sun just came out
and replaced all the memory for us.
-----------------------------------------------------------------------

        Original message asking for help:
        """"""""""""""""""""""""""""""""

=======================================================================

        Hi,

        I have two machines giving me random reboots (several days
between each one). One of them is a SPARC 5 - 170 and the other one a
SPARC 5 - 75 of which I attach the dmesg portion of the crash. Both
machines are running Solaris 2.6 with the recommended patches from two or
three months ago.

        Could this be a bad SIMM ? If so how to identify which one ? I
think that this here points me to it:
asynchronous memory fault: MFSR=81002040 MFAR=c609f00

        Thanks in advance,
        </canau

--------------------------------------------------------------------------
10:01:25 ~/sps-sol2 canau@bertha$ uname -a
SunOS bertha 5.6 Generic_105181-04 sun4m sparc
--------------------------------------------------------------------------
panic: asynchronous memory fault: MFSR=81802040 MFAR=c609f00
syncing file systems...WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000 (esp0):
        dma error: current esp state:
esp: State=DATA_DONE Last State=DATA
esp: Latched stat=0x0 intr=0x0 fifo 0x80
esp: last msg out: <unknown msg>; last msg in: IDENTIFY
esp: DMA csr=0xa4240212<EN,INTEN,ERRPEND>
esp: addr=fc01eee6 dmacnt=b600 last=fc01aa00 last_cnt=b600
esp: Cmd dump for Target 2 Lun 0:
esp: cdblen=6, cdb=[ 0xa 0xb 0x88 0xdc 0xf8 0x0 ]
esp: pkt_state=0xf<XFER,CMD,SEL,ARB> pkt_flags=0x4000 pkt_statistics=0x3
esp: cmd_flags=0xc62 cmd_timeout=60
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000 (esp0):
        Unrecoverable DMA error on dma
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000 (esp0):
        dma error: current esp state:
esp: State=DATA_DONE Last State=DATA
esp: Latched stat=0x0 intr=0x0 fifo 0x80
esp: last msg out: <unknown msg>; last msg in: IDENTIFY
esp: DMA csr=0xa4240212<EN,INTEN,ERRPEND>
esp: addr=fc01eee5 dmacnt=b600 last=fc01aa00 last_cnt=b600
esp: Cmd dump for Target 2 Lun 0:
esp: cdblen=6, cdb=[ 0xa 0xb 0x88 0xdc 0xf8 0x0 ]
esp: pkt_state=0xf<XFER,CMD,SEL,ARB> pkt_flags=0x4000 pkt_statistics=0x3
esp: cmd_flags=0xc62 cmd_timeout=60
WARNING: /iommu@0,10000000/sbus@0,10001000/espdma@5,8400000/esp@5,8800000 (esp0):
        Unrecoverable DMA error on dma
panic: asynchronous memory fault: MFSR=81002040 MFAR=c609f00
 4568 static and sysmap kernel pages
   48 dynamic kernel data pages
  388 kernel-pageable pages
    0 segkmap kernel pages
    0 segvn kernel pages
  496 current user process pages
 5500 total pages (5500 chunks)

dumping to vp f5b1d454, offset 481052
5500 total pages, dump succeeded
SunOS Release 5.6 Version Generic_105181-04 [UNIX(R) System V Release 4.0]
Copyright (c) 1983-1997, Sun Microsystems, Inc.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:43 CDT