Strange fileserver crashes

From: Grant Schoep
Date: Fri Oct 16 1998

Ok. This my summary to the strage fileserver chrashes that I had had a week
Thank all you who replied. Specifically:
Max Trummer
Don Cockman
Frank Smith
Eugene Kramer
Earl R. Cooke Rik Schneider
Andrew Hoerter
Francis Liu
Robert Rose
Reto Lichtensteiger
Steve Turgeon
Jeffery Keyser
Gary Franczyk
Kevin Inscoe
Gwendolynn ferch Elydyr
Asim Zuberi
AL Hopper
Bismark Espinoze
Whew, what a list, hope I didn't miss anyone.

Orig. help question.
>I have had the strangest crashes on my fileserver. Its a SS20, with dual
>75 mhz cpus. It has 18 SCSI drives on it, and 2 scsi cards. Its a very
>important system, that is crashing almost everyday, and its driving me
>crazy! Its running Solaris 2.5, its a NIS+ server and our main file server.
>Its crashes happen at many differnt times of the day, but has only
>happened when people are actually working( no 2am crashes.) It doesn't log
>anything strange in the /var/adm/messages. Its files systems are NOT
>filling up.
>Basically what happens, is that every machine loses contact with it,
>I can still sit down at it and log in. Sitting at the machine, everything
>seems fine. Except I can't get out on the network at all. I monitor the
>server's port on its switch, but there doesn't seem to be any errors going
>over the network. And other machines that don't rely on the file server(NT
>machines) all can communicate just fine with each other. A reboot of the
>machine, brings everything back to life as normal. Untill maybe 20-40 hours
>later, and then it goes into this state again.
>So it seems like the the culprit is the server itself, maybe its network
>interface, or a driver. Has anyone seen anything like this before? Is
>there a way I can try to reset the network card itself, bring it down then
>bring it back up?
>The only thing on the network that has really changed is that we added
>another router connecting us to our other offices. I would think maybe its
>a router problem, but it works just fine for a few days, and then the
>machine dies. I would have thought if its a router problem, that the
>machine wouldn't work all the time. Not just at random times.
>Thanks in advance for your help. I'll summarize.

I never really got my problem solved. But at least the machine is crashing
differently now. Doh. Heres what I did.
I found that I couldn't ping or connect to anything whatsover. It seemed my
network card, or port was bad. The switch I was connected to was just fine.
I tried switching from a 3com 3300 to a 3com 1000 and it was still crashed.
I started to log a number of important things. I noticed that the netstat
-r routing tables would forget about all the routers I have in our network.
I tried setting a default router, but that did not help.
        Finally, I have gotten the machine to run, and not crash like it did
before. The bad part was, I made two big changes, so I really don't know
which one it was. I switch from the built int ethernet port, to an
transceiver off of a AUI port. I also laid down the law and told everyone
that if they logged on to do anything on the file server I would kill their
process. Well, the system ran stable for a week. And never crashed. Untill
today friday.
        It was a different crash. This time, instead of just losing conection. The
server went dead. Stop-a didn't even work. Nobody(I think) was logged into
it. It finally came back up, and it paniced 20 minutes later and crashed
        I basicalyl decided to call it quits with this machine, and tell the big
heads we gotta replace our aging file server. With 90% of our machines
faster and better than our file server, I see no reason to keep on using
it. Except that its gonna be an all nighter switching everything over.
        So thanks everyone for your help, I don't think I really solved the
problem. But I hopefully will with a whole new machine. If anyone has any
detailed questions as to some of the little things I did feel free to ask.
Nothing seemed to work on my end. And I got users up in arms hounding me to
just replace the damn machine. So thats what I will do.

Grant Schoep,
System/Network Administrator
L3 Communications Telemetry & Instrumentation
San Jose,CA (408)271-0800, Ext. 135

