First, a copy of my original posting:
______________________________________________________________________________
Once in January and twice in the last 2 days, our 4/370 server has
died or come very close to dying due to problems with our disk
controller. The output into /var/adm/messages reads, typically:
Mar 9 23:05:24 sun-valley vmunix: xdc0: returned unmatched iopb addr fff003cc
Mar 9 23:05:32 sun-valley vmunix: xdc0: controller not responding
Mar 9 23:07:34 sun-valley last message repeated 5 times
Mar 9 23:08:18 sun-valley vmunix: xdc0: controller not responding
Mar 9 23:14:49 sun-valley last message repeated 12 times
Mar 9 23:15:19 sun-valley vmunix: xdc0: controller not responding
Mar 9 23:21:06 sun-valley last message repeated 12 times
Mar 9 23:21:24 sun-valley vmunix: xdc0: controller not responding
Mar 9 23:28:10 sun-valley last message repeated 10 times
The unmatched iopb addr message appeared 2 out of 3 times. The other
occurrence simply started with the controller not responding message.
This continues for about seven hours. By this time, system performance
has degraded to the point where logging in to the server is impossible.
The first time, a user who was here early in the morning managed
to L1-A and reboot the server. This morning, a different user
reported that L1-A had no effect and he had to resort to power-cycling
the machine. Unfortunately, I haven't been around with the system
in this state to see exactly what's happening on the server.
The machine is a 4/370 running SunOS 4.1. The controller,
sitting in slot 7, is a Xylogics 753 with PROM E2186 2.22.
The 753 has 3 Fujitsu M2382K disks on it.
We've been running this setup since November without any problems.
Has anyone run into this problem before? Is my controller going bad?
Is this a problem with the software driver? Do I have the right
PROM rev?
Any solutions/suggestions/wild guesses greatfully accepted.
-Dave
______________________________________________________________________________
Next, thanks to all those who replied:
tsacas@issy.ilog.fr (Stephane Tsacas)
From: Paul Quare <pq@computer-science.manchester.ac.uk>
sundev!fletch!kevin@Sun.COM (Kevin Sheehan {Consulting Poster Child})
curt@ecn.purdue.edu (Curt Freeland)
Stuart McRobert <sm@doc.imperial.ac.uk>
Finally, a summary of the responses:
A couple people suggested checking the bus grant jumper on the
backplane, particularly if we had just installed any new
boards. This was not the case for us. Curt Freeland pointed
out a few problems with the xd disk driver in his response:
>Welcome to the world of the "unfinished" xd disk driver. If you have
>source code, look at xd.c sometime. We have seen this problem, as have
>others. We are working on instrumenting the driver to see how we get into
>those states. It looks like it will happen on 3/XXX 4/XXX machines with
>enough load (we have seen it on everything but a 3/100 system). We have
>inserted halts at the point where those errors occur (and some printf's
>so we can see what is happening). I recommend you do not do a sync
>(or even a g0 to force a sync), as we usually end up with a trashed
>super block upon doing so after one of these errors. We typically do an
>L1-A and k2.
Others suggested possible problems with the disk controller
or the cabling.
Our solution:
Since the 753 is under maintenance, we had a new one sent
out. It featured PROM rev 2.3. We also switched one of
the 3 disks to a 7053 controller. So far, the problem has not
recurred. We'll just have to wait and see.
Thanks for all the help,
-Dave
David Meer
Aerospace Robotics Laboratory
Durand Building, Room 017
Stanford University
Stanford, CA 94305
e-mail: meer@sun-valley.stanford.edu
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:12 CDT