To recap:
  We were seeing a great number of SCSI bus resets on our new SS10/51
  with the equally new Seagate ST42100N 2.2 GB disk drive.
Summary (the short version):
  It was the SS10 motherboard.
Summary (long version):
  Our disk drive vendor (CITA Technologies) rushed us a replacement
  disk drive which we installed to no avail.  The problem continued.
  We tried new SCSI cables (always the 3' variety of the official Sun
  cables) and got nowhere.  Sun then sent a replacement motherboard
  which we tested with the replacement disk drive.  No go, but with a
  subtly different set of error messages.  We swapped back in the
  original Seagate drive and all has been happy since.  It appeared
  that the combination of the new Sun motherboard and the original
  Seagate drive did the trick.  Kudos to CITA who stayed on top of
  the problem even when we ruled out their equipment.
The Winner:
  Hans van Staveren <sater@cs.vu.nl> who correctly guessed it was
  the cpu board!  Your prize is waiting, but you need to pick it
  up in person.  Sorry, we can't tell you what it is, but it could
  be valuable.
--mark
--- Mark Morrissey Intel Corp. Senior Engineer Portland, Oregon USA SNMP/IP kinda guy +1 503 696 2068 markm@kandinsky.intel.com #include <disclaimer.std> "I don't speak for them. They don't speak for me."===================================================================
The Replies:
From: Mike Raffety <miker@il.us.swissbank.com>
Sounds like your SCSI bus might be a tad too long.
--
From: jdr@mlb.semi.harris.com (Jim Ray)
Have you checked the scsi cable? The only possibilities are either the scsi cable, the terminator, or the external drive. We have lots of sparc10's and never have had a scsi problem that was due to the internal scsi buss.
--
From: weingart@inf.ethz.ch
You write: Sie schreiben:
> esp0: bad sequence step (0x6) in selection
Had this happen to me with one sparc as well. I never figured out *exactly* what the problem was. However, you definitely want an *active* terminator on that external scsi line.
Also, the drive that we had this problem with took it's problem with itself, when we exchanged it for a newer one. I believe that there are some Seagate drives that don't match too well with a SS10. Maybe get your vendor to give you an exchange. The drive is likely fine, just not on a SS10, or this particular SS10...
--Toby. -------------------------------------------------------------- |Tobias Weingartner | PGP2.x Public Key available at | | +41'01'632'7205 | 'finger weingart@tau.inf.ethz.ch' | --------------------------------------------------------------
--
From: Bertil Roslund <bertilr@dit.lth.se>
About the SCSI problems, we've had the same problems with a SS10/512 and _some_ of our LX:es. As in your case the problems only occur when we connect a non Sun external disk. ( we've tried a Fujitsu and a Micropolis disk ). It also seems to work if we try a slower ( not 10MB/s ) external disk.
Unfortunately I have no solution. We told Sun about this about a month ago, and they keep saying that they are working on it..
So, what I really wanted to say is this, if you get a solution could yoy please let me know.
Thanks.
Bertil Rsolund (bertilr@dit.lth.se) Computer Engineering Lund University, Sweden
----
From: kevin@uniq.com.au (Kevin Sheehan {Consulting Poster Child})
As usual, I'd guess cables or cable length. Are you using the shortest cables (like 6 inches) that you can? Oftentimes, Sun ships the bit long ones with individiual units.
l & h, kev
-----
From: Bill Hart <Bill.Hart@ml.csiro.au>
The problem is in your scsi cabling, you could try putting FPT (Force Perfect Termination) terminator, and possiblly upgrading your cables (Sun has ferrite chokes around all of its cables to stop reflections). The other alternative is to switch to slower scsi. This involves setting scsi_options in the /etc/system file. I don't have the value handy but can find it if you want.
Cheers
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Bill Hart Internet : hart@ml.csiro.au Network Manager Phone : +61 02 325 442 CSIRO Division of Oceanography Fax : +61 02 325 000 Hobart, Tas., 7000 Australia Paging: +61 08 001 234 (quote #29474) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----
From: eckhard@ts.go.dlr.de (Eckhard Rueggeberg)
Simply send the external disk back. It's defective.
Eckhard R|ggeberg eckhard@ts.go.dlr.de
-----
From: Hans van Staveren <sater@cs.vu.nl>
We had similar problems on one SS10. For us the problem was maybe starnger since the errors disappeared when termination got worse. At the extreme running without a terminator worked. This is of course unacceptable during production. When we tried the same SCSI setup on another SS10 it worked.
Evntually Sun changed CPU boards and the problem went away. We are still not sure what was going on. One local hardware GURU suggested that the Sun might sent too short select pulses, but this is unconfirmed.
Good luck,
Hans van Staveren
-----
From: Dan Stromberg - OAC-DCS <strombrg@hydra.acs.uci.edu>
About all I can suggest is:
1) Make sure the cable length is short! 2) Use forced-perfect termination, with centronics plugs
-----
From: John DiMarco <jdd@db.toronto.edu>
In list.sun-managers you write:
Run scsiinfo on your disks (ftp from ftp.cdf.toronto.edu:/pub/scsiinfo).
There may be a scsi cabling problem.
John -- John DiMarco jdd@cdf.toronto.edu Computing Disciplines Facility Systems Manager jdd@cdf.utoronto.ca University of Toronto EA201B,(416)978-1928
-----
From: louis@andataco.com (Dances on keyboards)
A couple things come to mind:
1. Check that internal termination is *not* present on your external drive.
2. Use shorter cables.
-----
From: griffin@lehman.com (al griffin)
Is your controller SCSI or SCSI-2. If SCSI-2 is it BoxHill or Sun. If it is SCSI-2 and BoxHill all the devices must be configured. If not the devices will shift back into the unconfigured slots.
Al
-----
From: ian@sfu.ca
I had similar messages once and found that the disk drive involved had a jumper to control synchronous SCSI. Whether the jumper was to force/disable/negotiate/? I'm afraid I cannot remember, it was some time ago now, but I DO remember that it was the only jumper that had anything to do with synchronous SCSI. When the jumper was removed the host system was much happier, as long as the jumper was there I got continuous "reset" messages at boot time.
Hope this helps...
-- Ian Reddy, Systems Consultant E-mail: Ian_Reddy@sfu.ca Academic Computing Services, AD1021 ian@sfu.ca Simon Fraser University Telephone: (604) 291-3936 Burnaby, B.C. Canada V5A 1S6 Fax: (604) 291-4242
-----
From: ingram@cs.duke.edu (Robert E. Ingram)
Hi. Have you gotten anything on this? We are experiencing similar problems, although at a much less grave level, whith a group of ss2's having Acropolis 2G disks. I just have messages in /var/adm/messages, but i would hate to see this problem grow to the point that we couldnt pass an fsck!
-- Department of Computer Science, Duke University, Durham, NC 27706 Internet: ingram@cs.duke.edu (Robert Ingram, Systems Programmer) UUCP: mcnc!duke!ingram
-----
From: Phill St-Louis <phill2@hivnet.ubc.ca>
Look inside the external disk's enclosure to see if there are internal terminators. You might want to check if you received a Differential SCSI disk for the external disk.
Phill
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:08:24 CDT