SUMMARY: panic: asynchronous memory fault

From: Michael Hawk (mike@gi.net)
Date: Mon Feb 03 1997 - 12:17:53 CST


My original question:

>
> Hello,
> One of our Sparc 5's rebooted the other day due to a memory fault. The
> log files follow:
>
> Jan 28 10:46:42 jayhawk unix: panic: asynchronous memory fault: MFSR=81804040
MFAR=c8d620
> Jan 28 10:46:42 jayhawk unix: 4164 static and sysmap kernel pages
> Jan 28 10:46:42 jayhawk unix: 108 dynamic kernel data pages
> Jan 28 10:46:42 jayhawk unix: 170 kernel-pageable pages
> Jan 28 10:46:42 jayhawk unix: 2 segkmap kernel pages
> Jan 28 10:46:42 jayhawk unix: 0 segvn kernel pages
> Jan 28 10:46:42 jayhawk unix: 0 current user process pages
> Jan 28 10:46:42 jayhawk unix: 4444 total pages (4444 chunks)
> Jan 28 10:46:42 jayhawk unix: dumping to vp fc1e4d0c, offset 271864
> Jan 28 10:46:42 jayhawk unix: WARNING: /iommu@0,10000000/sbus@0,10001000/espdm
a@5,8400000/esp@5,8800000 (esp0):
> Jan 28 10:46:42 jayhawk unix: Unrecoverable DMA error on dma
> Jan 28 10:46:42 jayhawk unix: panic: asynchronous memory fault: MFSR=81004040
MFAR=c8d620
>
> My question is...does this indicate bad memory, which should be replaced?
> Or is this just something that happened, and will likely not happen again?

The people kind enough to reply:

RAVKRISH.IN.ORACLE.COM.ofcmail@in.oracle.com
kwong@scis.acast.nova.edu
raju@ecologic.net
css1dw@ee.surrey.ac.uk
peter.allan@aeat.co.uk
sweh@mpn.com
reynolds@acetsw.amat.com

The answers:

1) Many people recommended going to the ok> prompt and running some
   diagnostics, such as:

setenv selftest-#megs 64 (or whatever)
test-memory

I did this, and it tested clean.

2) One person said that it's possibly a symptom of a motherboard problem.

3) Here is a full explanation provided by RAVKRISH.IN.ORACLE.COM.ofcmail@in.oracle.com, who found it on comp.unix.solaris:

>
> I have worked on Sun workstations for about 2 years but
> only encountered a "Level 15 interrupt" a couple of times
> (now, being the second time). [...]
>
> Can anyone give me a "clear" explanation of this...?
 
Sure. The level 15 asynchronous interrupt is caused by
a memory error. I don't know what precise error you have
seen but you may have MFAR and MFSR values displayed in
the error message. These are the Memory Fault Address
Register and the Memory Fault Status Register respectively.
 
You may well be experiencing this as a fatal error (depending
upon whether your machine is ECC or parity). If it's fatal
(ie. panic) and it happens again and again, test the memory
from the ok prompt or use the values from the MFAR and MFSR
 
This explanation is taken from the Sun information document
repository:
 
There are two main causes of asynchronous memory fault panics.
 
 
1) The CPU cache did not flush properly to main memory.
 
The CPU can modify cache rows in its cache, such as cached data
which has been changed by a program. This data must be written
out to main memory at some point if it is to be accessed by other
processors or stored onto disk. The write that takes place is
asynchronous to the part of the CPU that uses the data (the part
which makes calculations, etc). It takes place from an on-chip
write buffer, to which cache rows are queued; writes from this
buffer out to main memory are completed by a different part of
the chip. The "asynchronous memory fault" occurs when the
asynchronous write from the cache to main memory terminates with an
error.
 
(Note that the actual write always takes place from the on-chip
write buffer regardless of whether the MMU is in write-through or
copy-back mode, or uses data that is marked non-cacheable.)
 
The error can be due to any hardware along the path between the
cache itself and the memory, including the CPU module, the
motherboard or the memory. Look elsewhere for more clues as to
what could be causing the problem, to narrow down the bad hardware.
Check the /var/adm/messages files and dmesg output for other kinds
of errors, perhaps (ecc) memory errors which would indicate memory
problems, other kinds of CPU errors which would indicate a bad cpu
module, Mbus timeout errors (which point to a potentially bad
motherboard), and so on.
  
 
2) An external device attempted to read or write a bad memory
   address.
 
This could be a hardware problem where the device was properly set
up but accessed a bad address, or the memory could be bad; or it
could be a software problem, because a device driver did not set up
its device to access the proper part of memory. Such a memory fault
is asynchronous with respect to the CPU, because the device tried to
do DMA to the memory, independent of the CPU.
 
The way to tell whether or not the case is (1) or (2) is to observe
the logistics of when the problem happens. Does this problem happen
consistently while a particular thing is going on? Software problems
tend to be consistent, predictable and replicable; whereas hardware
problems tend to be more random. Things to look for:
 
- Is there a third party device, which, when operated, triggers this
  panic condition?
 
- Are there DMA errors in the /var/adm/messages file to point the way
  to a suspect device?

-mike

-----------------------------------------------------------------------------
Michael Hawk
mike@gi.net



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:44 CDT