SUMMARY(2): help with 690MP system hangs

From: Timothy Baum 432-2765 (satmb@gauss.med.harvard.edu)
Date: Wed Jul 27 1994 - 09:14:43 CDT


A belated followup to my previous posting regarding 690MP system hangs:

The machine had been hanging at approximately 9:00 and 5:00pm. We
traced problem down to a single ALM-2 serial port to which a PC was attached --
when the PC user turned the machine on or off, the system would freeze.

I at first thought problem was with PC, but could find nothing wrong
with it. Then tried connecting a terminal directly to the ALM-2 port
with null-modem cable and found that simply turning on/off the terminal
would cause the same problem. We replaced that ALM-2 card and have
been running without incident for the last two weeks. My guess is that
the ALM-2 was reacting strangely to a break or transition of voltage on
that serial port and paralyzing the CPU with interrupts -- but who
really knows?

Thanks to everyone who wrote with help and suggestions. Several said this
sounded like the infamous "break on serial console interrupts system" problem,
which it does, except this was not a ttya or ttyb port and was due to bad
hardware.

This leads me to another question.

Apparently three VME boards in our 690MP went bad around the same
time. The problems started when one ALM-2 board stopped working; then
an IPI disk controller failed when the computer was power-cycled; then
after those two boards were replaced the replacement IPI controller
went bad (the replacement may have been bad to begin with); and finally
the intermittent system hang problems began, only to be fixed by
replacing another ALM-2 board.

Any ideas what could have caused several components to go bad around
the same time? It's too large a coincidence for me to accept that they
all failed independently. Is it possible that one bad component could
also have damaged the others? I don't recall any lightning strikes
around that time (but can't be 100% sure). No other equipment in the
building has had such problems.

Finally, on the chance that problems were caused by a power surge, what
is the most reasonable way of protecting equipment from damage? I've
gotten recommendations for isolating transformers and for full-blown
online UPS systems. The cost difference is dramatic -- under $2000 for
isolating transformer, vs over $10000 for UPS. Based on over a week of
power monitoring, the incoming power looks clean -- no spikes during
period of observation. Ours is not a critical operation that needs
100% uptime. Would an isolating transformer with its own ground be
adequate protection against equipment damage, or should I spring for a
UPS? Any recommendations for brands/vendors?

Thanks again for help and advice.

Tim Baum
System Administrator
Channing Laboratory
satmb@gauss.med.harvard.edu



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:06 CDT