Well, unfortunately I can't give a definitive answer as to what our problem was with our SS20 doing hard hangs on a daily basis lately, but it seems to have stopped. (It's been up for about two weeks now with no problems.)
There were a couple "schools of thought" from the responses I received from you guys. Several of you suggested I upgrade our kernel patch from revision -42 to -44 (101945-44), which we did. We also installed two other patches on Sun's recommendation: 102062-12 and 102001-11. Other possible solutions included: 1) overheating due to a fully configured system and 2) a failed power supply.
Thanks to the following for both your suggestions and general words of encouragement:
I apologize if I forgot someone.
Since we suspected the problem might lie in the hardware, we had to swap out every piece of hardware that could even remotely be causing the problem. Unfortunately we did not have the "luxury" of making one change to the system at a time to see if that one change did the trick. We did start by just backing out all of the NEW hardware that had gone in just before the crashes began, but no luck there. To make a long story short, we ended up gutting the entire system (all but the chasis itself and the disks) and replacing every piece of hardware MANY times over. Each time we introduced a new piece of hardware, it caused a NEW problem!!!! You can't imagine how many pieces of hardware arrived from Sun Service that were DOA!! It (most unfortunately) has become a joke around here, how each replacement part has to be replaced several times before we get a good one. I have never had this problem in the past, but this time it is beyond me what utterly poor hardware we received. (!
In Sun's defense..... it was difficult for them to diagnose a problem when we kept changing the configuration, but unfortunately we had no choice, we were desperate. Once we were finally able to get some crash dumps, Sun did say that it pointed to a specific CPU as being the problem. Once we swapped this out (a couple of times), the system does now appear to be stable. I must say that we also are currently running the system one SIMM short of a full bank (no jokes!) since at one point the system was panicking on a specific SIMM slot. Whether this SIMM being out is helping (heating), or the CPU was always the problem, or any combination of the above, I'm sure I don't know. Wish I could give you all a better answer!
Thanks again to everyone!!!
My original post....
We've got a really nasty problem right now with one of our most
critical servers doing the hard hanging trick. It is a sparc 20 which
has hard hung 4 times in the past 9 days. Technically we got 1
"watchdog reset" and 3 hard hangs. There was nothing on the console
except for previous messages about people logging into the machine,
nothing in /var/adm/messages, no crash dump, and no other indications
that we can find of any clues to the problem.
Here is a little history....
The machine has been running fine for many months. 2 weeks ago (just
before all this started) I added a second cpu, addtional memory, and
put a new SCSI cable on one of the controllers (thereby changing the
order of the disks on that scsi chain - we have both 50 pin and 68 pin
connector/disks on that chain). We also increased the swap space and
bumped up the value of SHMMAX.
This is our configuration....
ss20, two 60 MHz cpus, 512 MB memory, solaris 2.4 3/95 with patches
(kernel patch rev -42 and others), 3 additional fast SCSI controllers,
using ODS, WYSE terminal (although now have a "real" sun keyboard
attached to eliminate the terminal as a problem), 1 - internal 1.05 GB
disk, 2 - external 1.05 GB disks, 5 - 2.1 GB external disks, 1 8mm
external tape drive, internal CDROM, primary application is Sybase.
We also have two other "identical" systems at other sites which run
the same applications, same hardware configuration, same
patches/revisions, same o/s, etc.
This is what we've done.....
The first thing we did was attempt to replace all of the new things I
added just before failures began, thinking we had a hardware problem.
Sun swapped out the 3 SIMMs I had added and put in two new cpus (but
we didn't swap the SCSI cable).
The next time it hung, we swapped out the system board, the 5
remaining SIMMs and power supply.
Tonight we plan to swap out the entire chasis and all cables and add
patch #: 102062 (patch for serial device drivers).
My question is....
Has anyone seen any problems like this that would possibly shed some
light on ours????
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:11:45 CDT