SUMMARY: General System Reliability

Date: Tue Nov 10 1998 - 20:18:52 CST

It has been over 3 weeks since I posted the original message, but it was
something to the effect of "Is there a command (or commands) I can run that
will show me what is wrong with a system and can show me the reliability of
it?" The reason I asked was because we were having problems on two E3000
boxes. One box didn't have anything wrong with it, and was definitely
user-error related. However, the other 3000 had SCSI errors in the
/var/adm/messages file, ranging from "timeouts", to "reset" timeouts, to
"resyncing" messages all the way to "random position errors" on the hard
drive. Sun Service recommended we make sure the external SCSI port on I/O
Board 1 was terminated. So we terminated (in the SCSI sense) all of our
3000's (none of them were terminated on that board). Then after more drive
replacements and SCSI errors, Sun finally figured out what was wrong: We
were using the external SCSI port on I/O Board 1 with an external DDS 4mm
Sun tape drive. The 3000 that was having all of the hardware errors had
all internal drive bays filled (10 9Gb drives), so more than likely that's
the reason that machine specifically was reporting errors - possibly a
heavier load on the SCSI chain so not quite as reliable if it's not
terminated. Now that we've terminated all I/O Board 1 SCSI ports, and are
now using I/O Board 2's external SCSI ports, it seems that we're not having
any other messages. It has only been 8-10 hours so far, but that's a good
track record already.

Many thanks to the people who suggested checking the SCSI connections,
making sure termination was happening, checking the /var/adm/messages
files, using SyMON, making sure cable length wasn't too long, and making
sure that no pins were bent/missing from the cables.

Thanks so much for everyone's help!


