Unfortunately, I don't have a definite answer on this one, but
I do have some good suggestions. We will try shuffling the
controllers in our system the next time it goes down (it's been
running for a week with no problems at all).
Quite a few people that responded indicated they are fighting
with the same problem, with no solutions either. It looks as if there
definitely is a problem somewhere. Some of the suggestions I got
included (1) the Xylogics PROM needed updating, (2) the SunOS XD
driver was not as robust as it could be, and (3) that our board
ordering could be doing it.
I've attached the responses I received. At the bottom of this
message is a copy of my original posting. Thanks *very much* to the
following people who took the time to respond:
"Lawrence R. Rogers" <lrr@Princeton.EDU>
drm@bjerknes.Colorado.EDU (Donald Mock)
wallen@cogsci.UCSD.EDU (Mark R. Wallen)
Ronald Phelan <rjp@deakin.OZ.AU>
Amir Plivatsky <amir@discus.technion.ac.il>
curt@ecn.purdue.edu (Curt Freeland)
pomeranz@isis.dccs.upenn.edu (Hal Pomeranz)
Dianne Clayton <dclayton@BBN.COM>
matt@wbst845e.xerox.com (Matt Goheen)
lkn@s1.gov (Lee)
dmorse@sun-valley.Stanford.EDU (Dennis Morse)
---------------------------------------------------------------------------
Date: Wed, 11 Sep 91 20:27:34 MDT
From: drm@bjerknes.Colorado.EDU (Donald Mock)
Message-Id: <9109120227.AA02245@bjerknes.colorado.edu>
To: Steve.Ackerman@zindigi.msg.uvm.edu
Subject: 4/380 problems
My old 4/280 configuration guide shows the following order for those
boards: alm, xt0, xdc0, xdc1. I don't know if the 4/300 uses the same
relative assigments, but I bet it doesn't show the option to have an
SMD controller on both sides of the xt0. I know from when I upgraded
a similarly configured system from a 3/280 to a 4/280 that the controllers
were very sensitive about their relative positions. I've ordered the
same upgrade you already have, so I hope you work things out.
Donald Mock, Sys. Manager
NOAA Climate Research Div.
Boulder, Colorado
drm@noaacrd.colorado.edu
---------------------------------------------------------------------------
From: wallen@cogsci.UCSD.EDU (Mark R. Wallen)
Message-Id: <9109120402.AA01819@cogsci.UCSD.EDU>
Date: 11 September 1991 2102-PDT (Wednesday)
To: Steve.Ackerman%zindigi.MSG.UVM.EDU@uvm-gen.uvm.edu
Subject: Re: xdc0: controller not responding
With 2 controllers, the first thing I would try is
swapping them. THat will put xdc0 closer to the
CPU if I read your configuration correctly. That
in itself may help things immensely
Mark Wallen
Cognitive Science, UCSD
---------------------------------------------------------------------------
Message-Id: <199109120654.AA15114@sol.deakin.OZ.AU>
To: steve@uvm-gen.uvm.edu
Subject: Controller not responding
Date: Thu, 12 Sep 91 16:54:09 +1000
From: Ronald Phelan <rjp@deakin.OZ.AU>
I've also spent a few hours in trying to figure out this one.
The problem, as I have found, is the revision of the eeprom
on the 7053 controller. We purchased a controller (7053) from
another source other than sun. It was two revisions higher than
other 7053 controllers that were originally sourced from sun.
>From memory the sun controller was 2.21 and the other controller
2.23.
Anyhow, this controller gives random
xdc0: controller not responding
errors, and has been moved between three machines with the same
fault moving with the controller.
Apparently, Xylogics know about the problem but are unable to
generate the "good" eeprom as it is sun proprietry. Thats the
story I've been given.
We've still got the problem, and I'm still chasing our supplier
for a resolution. (I't been six months now).
Hope that this is of some help,
Regards,
Ron Phelan.
---------------------------------------------------------------------------
Date: Thu, 12 Sep 91 11:46:11 +0200
From: Amir Plivatsky <amir@discus.technion.ac.il>
To: Steve.Ackerman@zindigi.msg.uvm.edu
Subject: Re: xdc0: controller not responding
Organization: Technion, Israel Inst. Tech., Haifa Israel
In article <9109112220.AA01910@MSG.UVM.EDU> you write:
>
>We've been getting the message:
Steve,
We have this problem on our Sun3/260. Several times per month (last
time previous night) it gets stuck on this error (never continues by
itself even if we wait much time). We have one 9720-850 disk and 2
Fujitsu-M2382. (Previously we had only two 9720-850, and the same
problem occurred with the same frequency.)
When the problem occurs, always - without an exception, the display
on the 9720-850 shows the code for "off cylinder seek".
Amir
-- BITNET: amir@techunix Phone: +972-4-292658 Fax: +972-4-236212 Domain: amir@techunix.technion.ac.il UUCP: ...!pucc.princeton.edu!techunix!amir---------------------------------------------------------------------------
Date: Thu, 12 Sep 91 07:23:30 -0500 From: curt@ecn.purdue.edu (Curt Freeland) Message-Id: <9109121223.AA02644@mischief.ecn.purdue.edu> To: Steve.Ackerman%zindigi.MSG.UVM.EDU@uvm-gen.uvm.edu Subject: Re: xdc0: controller not responding
We have been seeing this message for a couple of years (4.0.3, and 4.1.1). If you have source code, look at the xd.c module. It is painfully obvious that Sun never really "finished" the xd driver. The author of the code makes some statements in the comments to the effect of "the management said make it work, but dont spend much time making it work right". We have hacked xd.c to try and determine what is happening. So far the only thing we have learned is that Sun is very lucky the driver ever worked at all. The whole problem seems to be aggravated when there is a lot of other bus activity.
It looks like there are a couple of problems. The "not responding" is due to an operation that does not complete within (what the driver considers) a reasonable amount of time. The comments in the code lead one to believe that Sun took a wild a** guess as to what this timeout constant should be. Unfortunately, just changing the timer constant breaks other things in the driver.
Another case where things go south is when the controller does an operation, then either forgets it did that op and does it again, or realizes something is wrong, and panics. You may see an IOPB mismath error message with these.
Our systems halt when they hit these cases. We manually reboot the systems. We have found that syncing the disks will sometimes cause more problems than a dead halt/reboot. This is particularly true if you get an IOPB mismatch message before you get the not responding message.
Too bad Sun does not seem to think user files are as important as the users do.
If you hear of any fixes for these problem, I would be very interested... --curt
Curt Freeland Manager, Systems Engineering Purdue University Engineering Computer Network (curt@mischief.ecn.purdue.edu) (317) 494-3715
---------------------------------------------------------------------------
Date: Thu, 12 Sep 91 09:09:01 EDT From: pomeranz@isis.dccs.upenn.edu (Hal Pomeranz) To: Steve.Ackerman%zindigi.MSG.UVM.EDU@uvm-gen.uvm.edu Subject: Re: xdc0: controller not responding Organization: University of Pennsylvania, Philadelphia
I had a similar problem with a 3/260. We replaced tons of parts and eventually the only real fix was to just shuffle the boards around on the backplane until the messages stopped happening. Basically, your best bet is to just move the disk controllers down to slots 1 and 2 so that they get the highest priority.
---------------------------------------------------------------------------
Message-Id: <9109121342.AA28466@uvm-gen.uvm.edu> Date: Thu, 12 Sep 91 9:38:08 EDT From: Dianne Clayton <dclayton@BBN.COM> To: steve@uvm-gen.uvm.edu Subject: xdc0 woes
Hi Steve,
I experienced this problem starting back in May and it went on, intermittantly, for about a month or so before the problem was finally resolved.
The machine I was dealing with was a 4/280 with the following configuration: 4/280 cpu - slots 1 and 2, 32M memory - slot 6, SCSI-3 - slot 7 (4 disks), and Xylogics 753 - slot 8 (1 disk).
The problem was difficult to trace, first, the Xylogics 753 controller was replaced, the problem went away and started to re-occur a week later. At that point Sun swapped out the CPU module, problem went away and reoccured about a week or two later. Finally, he replaced the CPU module again after this problem started to re-occur and then another problem crept up, consistant system crashing with CPU module giving out bogus information to both controllers, thus, trashing data on the disks, both scsi and smd's.
I'd tend to try and replace the controller, then the CPU.
Good luck,
Dianne Clayton Systems Administrator BBN Systems and Technologies (dclayton@bbn.com)
---------------------------------------------------------------------------
Date: Thu, 12 Sep 1991 07:55:20 PDT From: matt@wbst845e.xerox.com (Matt Goheen) Message-Id: <9109121455.AA25738@voyager.xerox.com> To: steve@uvm-gen.uvm.edu Subject: Re: xdc0: controller not responding
This has (and is currently) happened (happening) to our Solbourne server. We have replaced the controller twice, and will be doing it again today. The only thing I can think of is that it is a heat problem. This is one of three disk controllers and it's always the one in the middle of the other two that dies...
- Matt Goheen
---------------------------------------------------------------------------
Date: Fri, 13 Sep 91 12:19:44 PDT From: lkn@s1.gov Message-Id: <9109131919.AA14586@nightfall.s1.gov> To: Steve.Ackerman%zindigi.MSG.UVM.EDU@uvm-gen.uvm.edu (Steve) Subject: Re: xdc0: controller not responding
In article <106535@lll-winken.LLNL.GOV> you write: |> |> We've been getting the message: |> |> xdc0: controller not responding |> |> every once in a while on our sun 4/380 file server. It usually causes |> the machine to hang. Sometimes if we wait long enough, it will come |> back, but usually we run out of patience and reboot it. The machine |> had been up for 9 days, but twice today, it hung with this problem. |> |> This machine is our main fileserver with approximately 6gigs, 4 of |> which are off of the Xylogics 7053 SMD disk controllers. |> |> Here's a brief description of the hardware: |> Sun 4/380 -- recently upgraded from a 4/280 |> 2 xylogics 7053 smd controllers, with 2 disks per controller |> 1 xylogics 472 tape controller |> 1 alm-2 board. |> |> xdc0 is in slot #10 |> xdc1 is in slot #4 |> xt0 is in slot #8 |> alm is in slot #12 |> [ dmesg deleted ] |> |> -- |> Steve Ackerman (steve@uvm.edu || uunet!uvm-gen!steve) |> "It makes me angry that in order to get anything published it has to be of |> 0 value to the programmer" --D.E.Knuth Steve, Have you considered pulling the xdc0, and reconfiging xdc1 as controller 0 with 4 disks? You will have to change the mount points for xd4 & 5, (even simpler, swap the controllers ) You didn't mention if you've tried eliminiating the controller to determine if it is bad. (or the slot for that matter.) Seriously, sometimes these things are that simple. My guess (btw) is that you're using 2 controllers for load balancing. Those are pretty fast, and would work just fine with 4 disks....
I have played a bit with 7053s, and they do _occaisionally_ go bad, as to backplane slots. You might want to merely swap the two controllers, the 4/380 can handle them close in to the cpu. (ie slots 4&5)
Good Luck Lee
---------------------------------------------------------------------------
From: dmorse@sun-valley.Stanford.EDU (Dennis Morse) To: Steve <Steve.Ackerman%zindigi.MSG.UVM.EDU@uvm-gen.uvm.edu> Subject: Re: xdc0: controller not responding Date: Tue, 17 Sep 91 10:13:38 MDT
>We've been getting the message: > >xdc0: controller not responding >
We have a 4/370 with 32 Mb of RAM and both a 7053 and Xylogics 753 installed. This same message has come up on our systems and resulted in crashes. One of the other System administrators looked into it an never arrived at a solution. In our case it happens so infrequently (less than once a month) that we dropped it for now. We are very interested in anything you find out. Sorry I cannot be more help.
Dennis Morse Office: 415.723.1260 Stanford Aerospace Robotics Laboratory Lab: 415.723.3608 Department Aeronautics & Astronautics Durand Building, Room 250 Stanford, CA 94305
E-mail: dmorse@sun-valley.stanford.edu
---------------------------------------------------------------------------
Original posting: From: Steve <Steve.Ackerman%zindigi.MSG.UVM.EDU@uvm-gen.uvm.edu> To: sun-managers@eecs.nwu.edu Subject: xdc0: controller not responding
We've been getting the message:
xdc0: controller not responding
every once in a while on our sun 4/380 file server. It usually causes the machine to hang. Sometimes if we wait long enough, it will come back, but usually we run out of patience and reboot it. The machine had been up for 9 days, but twice today, it hung with this problem.
This machine is our main fileserver with approximately 6gigs, 4 of which are off of the Xylogics 7053 SMD disk controllers.
Here's a brief description of the hardware: Sun 4/380 -- recently upgraded from a 4/280 2 xylogics 7053 smd controllers, with 2 disks per controller 1 xylogics 472 tape controller 1 alm-2 board.
xdc0 is in slot #10 xdc1 is in slot #4 xt0 is in slot #8 alm is in slot #12
Attached is the latest dmesg. Any and all suggestions are greatly appreciated. The machine crashed during a class this afternoon, so we're anxious to track down the problem.
xdc0: controller not responding xdc0: controller not responding xdc0: controller not responding SunOS Release 4.1.1 (griffin) #1: Sat Aug 17 23:01:32 EDT 1991 Copyright (c) 1983-1990, Sun Microsystems, Inc. cpu = Sun SPARCsystem 300 mem = 32768K (0x2000000) avail mem = 30760960 Ethernet address = 8:0:20:8:58:92 xdc0 at vme16d32 0xee80 vec 0x44 xd0 at xdc0 slave 0 xd0: <CDC 9720-1230 cyl 1633 alt 2 hd 15 sec 83> xd1 at xdc0 slave 1 xd1: <CDC 9720-1230 cyl 1633 alt 2 hd 15 sec 83> xdc1 at vme16d32 0xee90 vec 0xdc1 slave 0 xd4: <NEC D2363 cyl 964 alt 2 hd 27 sec 67> xd5 at xdc1 slave 1 xd5: <Hitachi DK815-10 cyl 1735 alt 2 hd 15 sec 67> sm0 at obio 0xfa000000 pri 2 st0 at sm0 slave 32 st1 at sm0 slave 40 st2 at sm0 slave 24 st3 at sm0 slave 16 sd0 at sm0 slave 0 sd0: <HP 97549T cyl 1906 alt 2 hd 16 sec 64> sd1 at sm0 slave 1 sd2 at sm0 slave 8 sd2: <Fujitsu 2266SA cyl 1650 alt 3 hd 15 sec 85> sd3 at sm0 slave at xtc0 slave 0 zs0 at obio 0xf1000000 pri 3 zs1 at obio 0xf0000000 pri 3 zs2 at obio 0xe0000000 pri 3 mcp0 at vme32d32 0x1000000 vec 0x8b le0 at obio 0xf9000000 pri 3 root on xd0a fstype 4.2 swap on xd0b fstype spec size 40462K dump on xd0b fstype spec size 40432K
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:19 CDT