SUMMARY: Redundant power supplies aren't

From: Skip Hammack <skip_at_hammack.com>
Date: Tue Jun 22 2004 - 12:59:53 EDT
First, thanks to Eric Cortes Trujillo, Casper Dik, Tim Chapman,
Chris Hoogendyk, Dan Green, Ryan Krenzischek and Lar Hecking.

I have my latest SunFire 280R back in service at this time.

Sun replaced both power supplies, stating that both had failed at the same
time.  Guess its not going to be redundant when there isn't anything to be
redundant with.  The new power supplies are not the same part number,
not sure if thats a Sun thing or just a different batch.

My biggest issues were with the lack of error messages and a warning that
my power supplies failed/were failing, along with a prtdiag that was very
incomplete on the 4 failed servers.  There were also differing packages for
power in /var/sadm/pkg.  I also found out that many of my other 280R
servers had incomplete prtdiags along with differing power packages..

All servers were built from a jumpstart and identical except for additional
disk space layouts.  All servers were patched with the Solaris8 recommended
cluster patches.

Since I hadn't turned this latest failed server back over to the team 
yet, I went
ahead and rebuilt it using the jumpstart and then applying the cluster 
patches again.
My results were mixed, with an incomplete prtdiag and patch level.  I 
applied the
patch cluster again looking for failures or successes and checked 
again.  Better
but not identical to the good servers.  My logs showed success but the 
patches
were incomplete.  Casper recommended not using the install cluster due 
to problems
seen with loading these cluster patches.  I went ahead and did another 
jumpstart, and
a clean build, then applied the patches without the install script.  
Voila!  Success.

I then pulled the primary power supply and got loads of errors, but the 
server
stayed up.  prtdiag showed the failed power supply.  I then put the 
power supply
back in, maintenance light went out, prtdiag showed clean, no failures.  
I then
pulled the second power supply and the server shutdown.  I  checked the 
power
cords, made sure everything was installed correctly and hit the power 
button. The
server came up fine to single user and all my errors were that the 
second power
supply was bad.  I then put the second power supply in again and all 
errors cleared.
Its been up and running since early this morning.  I've been told that I 
took the
power supply out too quickly while testing and that I didn't leave 
sufficient time to
allow the system to catch up.  I haven't had time to check it.  Also of 
note, with
the system up and the second power supply pulled, there wasn't a maintenance
light on the front panel.

I built another 280R using the 02/02 release and the patch cluster and 
it seems to
be working out so far.  I don't think its an issue with the jumpstart 
since some of
the servers are fine, but am leaning towards patch cluster issues at 
this time.
I tried on the previous build to download another recommended and run 
it, but
it appeared to be identical issues with installing as a cluster.

Summary:  I will be applying patches across the board to all servers and 
ensure that
they are all identical and have the same packages and patch level.

Also for those using BigBrother, with everything correct on the server, 
BB picked
up the failed disk when it was pulled.

thanks again to all.
Skip
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue Jun 22 12:59:44 2004

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:34 EST