Summary: Detecting failures of redundant hardware

From: Thomas M. Payerle <payerle_at_phys-mail1.physics.umd.edu>
Date: Tue Sep 25 2001 - 14:25:40 EDT
There have been a number of replies with regard to a question I asked about
detecting failures in mirrored disks, redundant power supplies, and dual 
CPUs on a Enterprise 220R server running Solaris 2.7.

Steve Camp's answer sounded like he knew what he was talking about, and
basically stated that one would need a higher class machine (e.g. E250 or
Ex000/Ex500) to detect a failed powersupply.  I guess I should be glad that
they at least moved the status LEDs to the front ofthe machine.  He also,
probably correctly, pointed out that the CPU's are not really redundant,
and that the machine would probably go down if one failed.

A number of people suggested the Sun Management Center, which is free for
the basic functionality product.  I have not tried this, but it is not clear
that this will detect what I want either.

A number of people also suggested the Big Brother semi-freeware product
(http://www.bb4.com).  I currently use the freeware product Netsaint 
(http://www.netsaint.org) for most of my monitoring, and did not see much if
anything in the Big Brother description to tempt me to switch.  I did examine
the modules which were supposed to check hardware like power supplies in Sun
hardware, but this used the prtdiag command and both my experience and the
scripts indicated would not detect power supply failures on a 220R.

One or two people also recommended swatch 
(ftp://coast.purdue.edu/pub/tools/unix/swatch) to check for errors related to
such items in the logs or console messages.  However, I am having an odd 
problem in that none of my simulated failures (unplugging a power supply,
offlining a CPU, offlining a submirror) appeared in the logs.  I am not sure
if that is due to the inadequacy of my simulations, or something more 
fundamental (and needless to say I am reluctant to increase the reality of these
simulations too much).

I have set up a cron job to check output of metadb and metastat, thereby
covering disk problems (also have mdlogd on, but so far hasn;t helped, possibly
because no errors showing up in logs).  Will likely add checks of psrinfo
(should indicate a CPU problem if it doesn't crash the machine), and 
prtdiag -v (not entirely sure what problems that will detect, but am pretty
sure I want to check it.  It does NOT appear to detect power supply failures,
at least in my tests.)  

For now will have to hope the amber led will be manually detected to catch
power supply failures.  Maybe eventually can rig a phototransistor or some
such to detect the LED:).

Thanks to all those responding, including Elizabeth Lee, Cristophe Dupre, 
Steve Camp, Gary Losito, Bertand Hutin, Kevin Buterbaugh, and Arthur Aldridge.


Tom Payerle 	
Dept of Physics				payerle@physics.umd.edu
University of Maryland			(301) 405-6973
College Park, MD 20742-4111		Fax: (301) 314-9525
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Sep 25 13:23:52 2001

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:26 EST