After replacing the system board, all the controllers, and the cables,
and after talking multiple times to tech support from four companies,
and after having the darn thing crash every day for two months, it turns
our what was what was needed was to add a patch that wasn't on the
recommended list. *Sigh* I had gone through the patch list several times
and must have overlooked it every time.
I appreciate the responses I received from
Eugene Kramer <eugene@uniteq.com>
Glenn Satchell <Glenn.Satchell@uniq.com.au>
Cat Okita <cat@uunet.ca>
Thomas FRANK <thomas.frank@magnet.at>
Bismark Espinoza <bismark@alta.Jpl.Nasa.Gov>
My original question:
> SPARC 10,128 MB RAM,Solaris 2.5 w/ recommended patches
>
> Soon after installing Cheyenne ARCserve software to run our Qualstar
> 4210A tape changer, our main NFS server started freezing at random
> places during a full backup or sometimes the next day. During the
> backup, there are no other users on the network and no unusual processes
> running. Strangely, it doesn't crash or show any error messages; it
> just freezes completely. I cannot ping the machine and STOP-A does
> nothing. The only way to recover is cycle the power manually.
>
> Naturally, Sun said it was probably an application problem while
> Cheyenne said it is probably an OS problem. I have syslog logging
> everything from info on up to one huge log, but there are no error
> messages of any kind reported. I have vmstat logging to a typescript
> every five seconds during the backup, and the memory goes down to 1 or
> two MB lots of times during this time, but there is always over 200 MB
> of swap available when it freezes.
>
> I have tested the hardware with some graphical package from Sun(I forget
> what it's called) and everything checked out ok.
>
> Does anyone have any further troublshooting techniques or ideas why the
> machine completely freezes at random times during or after a backup? I
> am pretty certain it has to do with the software since the trouble
> started occuring soon after install and only appears after a backup, but
> how can I confirm this? Does anyone have similar problems with Cheyenne
> ARCserve or can recommend a different software package for running a
> tape changer? Are there any other ways i can probe the OS to isolate
> what is causing the problem?
>
> Thank you.
>
Responses:
-----------------------
I had a problem liek that on a Sparc 10/128M/Solaris 2.5 WITHOUT
Cheyenne.
Turned out that we had a disk drive with crappy SCSI. Our Sparc was a
file server and it usually would hang during release time ( our software
occupies about 500M ) or backup (Networker with 3 simultaneous backup
strams).
Taking the disk out of the picture got rid of the problems.
When system crashed I almost always had a selection light on the faulty
disk.
BTW: disk: Micropolis 9G (old 5 inch format). I've just gotten a new one
from Micropolis, but did not install it yet.
-----------------------
There is a (reasonably well known) bug in the ethernet hardware on the
SS10s. Apparently there isn't enough buffer space devoted to the lance
chip. It was revised in the SS20 which doesn't have this problem. The
symptoms are that the ethernet locks up under heavy load. I've seen
this happen on database servers. Sometimes there's an error message in
/var/adm/messages, othertimes not.
It's a hardware design issue, so there's no software workaround. What
most folks do is to install a sbus ethernet card and use that
interface. The 10MB/s ethernet/scsi cards don't have this problem and
work just fine. Of course, you could always buy one of the 100MB fast
ethernet cards too.
-----------------------
This almost sounds like the ethernet problem again - Sun added in a
patch
(*not* on the recommended/security list, btw) for their ethernet
interfaces
to fix this...
-----------------------
I had maybe the equal problem with Solaris 2.5.1 and Legato Networker on
a SS10.
During the backup or to other times the server frezzed. The
Backup-system was
only installed a few weeks ago. So we thought, it depends on the
Backup-HW
and/or -SW.
But we found the error in changing the system components - power supply,
motherboard and at least the CPU. And it was the CPU !!!
So, if you have a chance to test the Backup-System on an equal machine,
then do
it, or change like I did the system components of the SS10.
-----------------------
Look at cpu load with "vmstat 5", ioload with "iostat 5",
and nfs load iwth repeated "nfsstat -s" .
-----------------------
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:11 CDT