2nd SUMMARY on 4/280 problem

Our problem with a Sun 4/280 server started after the Northridge

The problem appeared to be with the memory board(s) or the MMU at
first since the server would work for a few minutes at first then
it would just shut itself off; i.e., no activities on the CPU LED.
As if the power has been turned off. However, it is not the case,
since everything else is on; i.e., disk drives, all the fans(including
the power supply's fan)and the power switch light(the power switch
"knob" has a light so it acts as a power indicator) is on too.

The weird thing is that after a few minutes of no activity(simply just
left it alone, power switch left at on position), the server would
suddenly jump back on-line. And the server almost always pass the
extended self-diag.

(Another reason why we suspected it was a problem with either the
memory board(s) or the MMU is that we saw messages like:

Dec 29 15:09:43 hto-e vmunix: mem3: soft ecc addr 3bf7c8 syn 2<S0> 65 U1665

before this happen. so we thought the board/the chip(s) has finally
died. And also due to the earthquake, the A/C unit which normally
cools the room was damage, inoperative for more than 24 hours. Room
temp was more than 80 degrees for some time before we shut it off,
which is the oldest server we have)

Anyhow, the first thing we did was just reseated everything; e.g.,
memory boards, cpu boards. ok. still crashing.

fine. pull out memory boards. try one memory board at a time to see
which board is bad. all boards appear to be bad. Not likely, we

So I asked the net. And many people said it's a heat problem. Check
the fan, etc. All fans are working.

We concluded that there are 3 possibilities: power supply, memory
boards, cpu boards. and since the power supply is the least expensive
item to replace and a loan was available to us locally, we decided to
test that first.

By that time, I received a reply from Fons Ullings about testing the
power supply. This is what he said:

   you could try to remove the front of the machine, so that you
   can see the backplane, and measure the 5Volt on the main VME power rails
   (preferable with a scope to see AC too)
   I really suspect the pwer-supply
   you could also try to let the test mode go ffor more then 1 cycle
   (if I remember correctly, that is settable in the EEPROM or with
   the 'x' command from the EPROM boot)

   and maybe you can check all the connections+connectors
   between the power supply and the backplane

   hope this helps

   Fons Ullings, VU, Amsterdam

So we voltmetered the 5V. And guess what. The 5V comes
and goes at random. And when it goes, the machine dies.

I talked our our local Computing Services support staff
about our discovery. He told me that it's very unlikely
that would happen; i.e., only part of the power supply
fail. It's either all or nothing, usually.

But guess what, it happened. To confirm this, we loan
a power supply from him. And the problem goes away.

So finally last week, we ordered a power supply for
it. And it's working fine now.

We think that it's a heat problem. Although, not
because of a fan problem. We think some capacitors
in the power supply are damaged. Although only
marginally. (we took the power supply apart, and
didn't see the tell-tell sign of a blown capacitor:
black top on the capacitors) So when the capacitors
heats up, they failed to absord charges; hence, power
dies. And the machine crashes. But when they cool
down, they would work again.

Anyhow, I hope this can help someone who might have
the same problem as we had. I guess the moral of the
story is that: 1) always check the least expensive
item for fault(best case) 2) always check with your local
people for resource and help first before you go for
outside assistance(usulay cost $$$), get a loaner from
your local people(usualy free) and test the components
in question(voltmeter it or something)



