SUMMARY: question: solaris 2.4 scsi problems

From: Michael Moscovitch (michaelm@citenet.net)
Date: Thu Jul 20 1995 - 10:59:08 CDT


I haven't finished trying all the solutions yet, but many people
with similar problems have asked for a summary so I will try
and give you and update of the suggestions I have received.

I tried increasing the timeout on the tape drive. The
timeout errors did not occur but the machine seemed to lock up
after writing to the tape for a while.

It has been suggested that I try rearranging the physical order of the
devices on the scsi bus such that the disk was closest to the host and
the tape was farthest.

Lewis E. Wolfgang recommended the following:

Try using a "Forced Perfect" terminator. This is NOT just
an active terminator, don't let the sales-critters fool you.
Forced Perfect is a brand name and is manufactured by Aeronics,
Inc. They have worked well for us.

John Payne and Shigeki Misawa suggested increasing the timeout for
the tape drive.

According to Shigeki Misawa:

Several suggestions.
First, make sure you have the relevant patches installed (10200x, I
think x = 3).
 
Second, try replace the following line :
 
EXBT-4200C=1,0x34,1024,0x0239,4,0x63,0x00,0x43,0x00,3;
 
With either :
 
EXBT-4200C=1,0x34,1024,0x1239,4,0x63,0x00,0x43,0x00,3;
                         ^
                         L note the 1
or :

EXBT-4200C=1,0x34,0,0x1239,4,0x63,0x00,0x43,0x00,3;
                         ^
                         L note the 1
 
The one increases the length of the scsi timeouts.
The change from 1024 to 0 is required for the HP-C1533A 4mm DAT.
The 1024 works in 2.3 but not in 2.4
 
Third, try disabling command queuing in /etc/system as outlined in the
Solaris FAQ. (Note that this will conflict with the "set scsi_options
= 0x58" setting you placed in your /etc/system file.

According to Scott Kamin:

Also, the third paramater in the parameter list should be 0
under 2.4. There was a bug in Solaris 2.3 which required a
non-zero entry for the block size field. It has been fixed
in Solaris 2.4. A zero in the third field (under 2.4) indicates
variable block size, which you want.

According to Steve Pyrczak:

You might be having the same problem. I'm forwarding a second fix
from someone else that seems to have worked for us except during a
brownout when the ups ran out, which kind of made sense as the
computer may have come up before the devices were ready. Sun was
not helpful and kept blaming it on specific devices and only giving
general things to look at. We has an SE and CE out here twice under
support and all they did was change boards to no avail. I can't verify
that the other fix I will send will work unless you take general
precautions about cable lengths and bandwidth restrictions.
        Date: 6/29/1995 8:08 am (Thursday)
        From: Steve Pyrczak
          To: unix:tehoyt01@msuacad.morehead-st.edu
     Subject: scsi errors

We were having similar problems on two Suns which were pretty well
packed with devices ( 8mm/qic 150 tapes, WORM. CD, 2 old SCSI
drives). We were stretching the bandwidth and cable length
restrictions, but that wasn't the real problem. We got one Sun tech rep
who sort of knew what the problem was, but got no solution or promise
for a fix. It is definitely a problem with the Sun 2.4 SCSI device drivers
as we ran the same configuration for 2 years under 4.1.3, although our
service calls began after adding new devices which threw us off the
track. Our second machine which did not have any device upgrades
also had the same problem later.

This isn't really a good technical explanation but ...
In short, when the machine boots up the devices that were there before
are recognized automatically ( boot -r rereads the chain ). The device
driver then appears to adjust speeds to the slowest device it meets as it
does it handshaking up the chain. The devices are polled I believe in the
order of their SCSI address ( 0-7) as they are in parallel logically. If you
have a old slow device like our 4 year old worm, the SCSI device driver
adjusts itself to the slower speed and then does not readjust itself
when talking to newer faster devices. An analogy might be a old
fashioned modem array that rolls over to the next available modem, but
always starts with the first available one on the chain that answers. If
the first available modem was 1200 baud, and the rest were 9600, all
would be spoken to at 1200. If the first was 9600 and a later one was
1200 you would have the opposite problem. Basically the SCSI device
driver cannot adjust speeds correctly. Our own tests proved that if we
turned off the worm ( scsi id 2) and the Sun cd things and did a boot -r
it worked okay, but we needed these devices. I can't really say
whether this was because the total bandwidth requirements for the
chain were reduced or they were the slow devices the device driver
was limited by. We had physical limitations ( an old Sun P-box) that
prevented us from easily changing scsi IDs on all our devices without
having to open them up to change jumpers, as we had to keep the
machine in production and couldn't play with it much. The problem
manifested itself primarily with the 8mm tape drive which either gets a
hard error when it can't communicate or just times out somehow. "SCSI
transport failed: reason 'reset': giving up" and "retrying command" was
exactly the problem. ufsdump would then abort. If we checked the
times of the abort and the transport failures, they almost always
coincided. Other related messages might include "Unable to
install/attach driver 'isp'" and "failed reselection" during reboots. The
only difference between your errors and ours are the physical
addresses between our machines. We also have a high capacity
Exabyte, but our other machine had failures with an older 2GB Exabyte
that worked for years under 4.1.X . There was a patch that provided
new exabyte support, but that didn't help the problem since it wasn't the
exabyte that was the problem but their device driver support.
We were running under 4.1.3 for at least 2 years with the same SCSI
configuration and devices, but didn't have any problems till running 2.4.
Everytime we called Sun, they would shoot in the dark and tell us about
some patch that was totally irrelevant. If I told them tar was the
command aborted that last time, they'd tell me to put the tar patch on.
We were usually running ufsdump when it aborted. Rebooting the
machine only resets the device driver temporarily. The problem was
also intermittent. It might run a day or two before failing, but that could
have been because of the combination of devices being accessed at
the time. In most cases, the tape drive failures were cron routines run
at 2 AM and nothing else was running on the machine.

Our solution may not be applicable to your situation. We had two scsi
controllers in the machine to begin with and were upgrading devices
anyway ( we got new 8mm 5GB tapes, Worms, and a RAID ). The
RAID is now on a controller by itself, and the other devices are faster
speeds and don't have to contend with the disk drives.

If you have the luxury of 2 scsi controllers, you might want to try
distributing the load, perhaps putting all the older devices on one, or at
least redistributing the disk drives. Keep your external cables as short
as possible, and make sure the last device has a newer lighted
small terminator. This is all Sun told us, and is just standard
commonsense type of SCSI practice. Every SCSI device has a
bandwidth, if you have a streaming tape drive especially, you usually
don't put it together with multiple disk drives. If both drives are actively
hitting and the tape drive is running, you will never get the advertised
maximum bandwidth from any of the devices. I'm not sure if changing
the SCSI ID numbers and doing a reboot -r will help because we did not
try that, its just my theory.

We went around and around with Sun support on this on also. There
are several other similar complaints that are usually hardware specific
on the Sunsolve BBS. Sun appears to put the responsibility on the
specific device, and the user is lead to believe the problem is only
happening to them because of their devices. But as I said, we used
identical setups under 4.1.3 for years and all the peripherals were
supported Sun devices. We had boards changed, changed terminators,
moved around SCSI devices, shortened cables, and a bunch of other
things Sun recommended, but it boiled down to a device driver that can't
hack it. I know this is not a real solution, but maybe you won't get
side-tracked into thinking it is because of your devices like we did. If you
get a better answer from Sun than we did I'd appreciate their answer.

Steve also sent a copy of the following summary:

        Date: 6/30/1995 9:46 am (Friday)
        From: RACE.SMTP."tehoyt01@msuacad.morehead-st.edu"
          To: RACE.SMTP("sun-managers@eecs.nwu.edu")
     Subject: SUMMARY: scsi errors on 9.0Gb drive

Ok, to start with...sorry I confused some of you with my fancy artwork.
Everything was in a one line chain but my diagram looked as if it were
somehow split at the cd-rom. Oh well.

I only tried one recommendation that came from two people that reported
they had the exact same problem. There recommendation worked and I'm
thankful.

It appears that Sun's default scsi options were set so that everything
in the chain was running at 10m/s which was confusing the tape drive and
causing the errors that I thought was being caused by the disk drive.
The fix for this is to create the file:

/kernel/drv/esp.conf

with this line it in:

scsi-options=0x178

This line seems to fix things and still supports Fast SCSI-2.

I tested the fix by reading a massive data run through our astronomy
application as well as tar -xvf <really big file>.tar in another window.
I had no errors and everything seemed to work fine.

There were many letters saying I needed to change terminators and shorten
my cables, while this is still true, for the cables at least...it seems
the above fix did the trick. Performance may not be the best but it
works and that's all I can ask for at this point.

Many thanks to:

cbarker@cp.tybrin.com
sitongia@ncar.ucar.edu
Steve_Pyrczak@racesmtp.afsc.noaa.gov
rali@hri.com
tl@gam0.phy.anl.gov
b.king@surrey.ac.uk
cus1hl@surrey.ac.uk
johnm@ntl.co.nz
icarr37863@aol.com
chrisc@Chris.Org
Pell@lysator.liu.se
andreww@adacel.com.au
kevin@uniq.com.au
Rainer.Ullrich@mch.sni.de
john@float.demon.co.uk
Jeff Marble

--
Travis Hoyt <tehoyt01@msuacad.morehead-st.edu>

"The world will be a much better place when the power of love outweighs the love of power" - Gerry Spence

-----BEGIN PGP PUBLIC KEY BLOCK----- Version: 2.6.2

mQBtAy/JQwMAAAEDALQ6beERoHf+/JqhNveTzBFiJJQ8UBMtPlHvQBVvBskgQJUJ 1vqEv9bg19tCH6KYU1NBfRx/kykL+eOe+8sCqGVoKvC1WzBdppKHkLWhDkD6D0ee O4n5ETqzr/hYlYK1BQAFE7QuVHJhdmlzIEhveXQgPHRlaG95dDAxQG1zdWFjYWQu bW9yZWhlYWQtc3QuZWR1Pg== =4tI2 -----END PGP PUBLIC KEY BLOCK-----

Many thanks to:

Steve_Pyrczak@racesmtp.afsc.noaa.gov wolfgang@sunspot misawa@physics.Berkeley.EDU johnp%baldric@dshroot.co.dsh.oz.au Scott.Kamin@Central.Sun.COM

-- +---------------------------------------------------------------------------+ | Michael Moscovitch CiteNet Telecom Inc. | | Tel: (514) 861-5050 | | michaelm@citenet.net #include <disclaimer.h> | +---------------------------------------------------------------------------+



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:29 CDT