SUMMARY: e3500 reboot after "fatal error FATAL" // CPU address controller issue (??)

From: Tim Chipman <chipman_at_ecopiabio.com>
Date: Thu Oct 09 2003 - 12:08:45 EDT
Many thanks to those who responded (in no particular order): Stephen 
Kives, David Price, Tony Magtalas.



Comments include,
=================

-This is typical behaviour associated with hardware failure on this 
platform. In this specific case, it may suggest the board in question 
will be the source of future failures/downtimes and that replacement may 
be a requirement "soon".

-Consider running SunVTS overnight to see if the problem re-surfaces?

-It also seems that some 3500s have been observed exhibiting problems 
with self-diagnosis // sporatic reboots // hardare failure issues. 
(Direct quote from one reply follows below on this topic)

In our case, the system has now been up 8 days and running smoothly 
without further errors. Based on replies I got, it may mean that another 
crash looms around the corner -- or -- things may continue to run as 
they are.  For the moment, I'm going to wait and see what happens.

Hope this summary is of some use to others.

Tim Chipman





--direct quote from one reply--

We have 2 E3500's in use and while I have not experienced your exact 
problem we have had board problems that were intermittent in nature. The 
machine would crash and there would be very few diagnostic messages to 
point to the real problem.

Machine would boot OK and then run for hours or weeks before it crashed
again.  We replaced component after component with no luck.  Eventually 
the problem became bad enough that the system would fail to reboot and 
the error messages finally pointed to the real problem.

Only then were we able to locate the failed piece of hardware.  In our 
case this was a bad I/O board.

We have also had CPU's fail in the same manner.  The machine would 
reboot OK and would run for weeks before crashing again.

Overall it left me with a poor feeling on the diagnostic ability of 
these machines.

Good luck.

---endquote---



===========================================================================
ORIGINAL POSTING FOLLOWS:
===========================================================================
] Hi all. Googled and searched listarchives to no avail (along with
] sunsolve) so I'm pestering folks here.
]
] We've got an e3500 (4x400mhz 2 gigs ram solaris 8 with recommended
] patch-cluster applied this friday past) which spontaneously rebooted
] yesterday morning. Prior to this, the machine hasn't had a
] suprise crash
] in ages (~>16 months?).
]
] Logged on the console at the time was a comment more-or-less to the
] effect of, "NOTICE: failed cpu board in slot 7"
]
] The system came back up on its own with 2 of 4 CPUs online.
]
] Logged in /var/adm/messages at this time of boot:
]
] unix: [ID 796976 kern.notice] System booting after fatal error FATAL
] ...
] fhc:  [ID 744982 kern.notice] NOTICE: failed cpu board in slot 7
]
] Once booted, examination of prtdiag -v confirmed this (see
] output below,
] "2-cpu prtdiag-v"). Machine ran "smoothly" all day on 2 CPUs.
]
] Last night, when a bit of downtime was available, I fully powered the
] machine down ; popped out the board in question & confirmed
] CPU & memory
] was all seated well and that nothing was obviously "fishy" in
] appearance
] ; replaced the board and brought it back up.
]
] It came back up with all 4 CPUs running, and no errors logged
] // nothing
] fishy in prtdiag -v (see below for output, "4-CPU prtdiag-v".
] Since that
] time (~16 hours so far) the machine is running smoothly.
]
] Has anyone else ever seen this kind of behaviour // has any
] ideas? Not
] exactly a happy-dandy thing to have the machine crash like this, and
] somewhat disturbing that it appears ? to be a "false positive" for
] detection of a problem.
]
] Any thoughts / comments / etc are certainly greatly appreciated.
]
] Thanks,
]
]
] Tim Chipman
]
]
] -8<----8<--------8<----paste---8<------8<-----8<-----
]
]
] 2-cpu prtdiag -v (partial output):
]
] ========================= CPUs =========================
]
]                      Run   Ecache   CPU    CPU
] Brd  CPU   Module   MHz     MB    Impl.   Mask
] ---  ---  -------  -----  ------  ------  ----
]   3     6     0      400     8.0   US-II    10.0
]   3     7     1      400     8.0   US-II    10.0
]
] ...
]
] Analysis of most recent Fatal Hardware Watchdog:
] ======================================================
] Log Date: Tue Sep 30 09:16:07 2003
]
]
]   Analysis for Board 7
] --------------------
] AC: P_FERR error P_REPLY received from UPA Port
]          The error could be caused by:
]                  CPU
]                  Address Controller
] AC: Illegal P_REPLY received from UPA Port
]          The error could be caused by:
]                  CPU
]                  Address Controller
]
]
] ------end-of-this-bit.
]
] then following hard reboot in evening - all is well ? -
]
] 4-CPU prtdiag -v (partial output):
] ========================= CPUs =========================
]
]                      Run   Ecache   CPU    CPU
] Brd  CPU   Module   MHz     MB    Impl.   Mask
] ---  ---  -------  -----  ------  ------  ----
]   3     6     0      400     8.0   US-II    10.0
]   3     7     1      400     8.0   US-II    10.0
]   7    14     0      400     8.0   US-II    10.0
]   7    15     1      400     8.0   US-II    10.0
] _______________________________________________
] sunmanagers mailing list
] sunmanagers@sunmanagers.org
] http://www.sunmanagers.org/mailman/listinfo/sunmanagers
]
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Thu Oct 9 12:08:40 2003

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:22 EST