SUMMARY: disconnected tagged cmds crashing server on seagate disk

From: Patrick McCook (cookie97@flash.net)
Date: Tue Mar 10 1998 - 19:58:58 CST


Thanks for all of the helpful replies. Unfortunately, there really
weren't any new options that we haven't already tried. There was one
post, however, that suggested that the seagate drive is not 100%
compatible with the fast SCSI 2 command set. We may be purchasing a
Quantum drive as an alternate test. I neglected to mention that our
platform is Solaris 2.5.1 on an Ultra Enterprise 4000.

Here are the responses:

========================================================================
We are also having the same problem with Seagate Barracuda 4.3GB drives
(ST34572WC). I've been playing with it for about three months now.

This is a very common problem; I've seen a lot of people ask this same
question in the last four months, and I'm surprised that there doesn't
seem to be a deterministic answer. I have tried everything from hacking
/etc/system from here to breakfast, removing extra drives, and like you,
swapping many parts. The system had its best life when the CDROM drive
was disconnected (20 days continuous uptime). But it seems to be
random: sometimes hours, sometimes days.

I'm out of ideas, and am now down to groping at straws, IMHO: checking
grounds, power, heat, etc.

        Patrick Rigney <patrick@evocative.com>

===================================================================================
You could try disabling tagged command queueing in your kernel.
Edit /etc/system and add the following:

* turn off command tag queing to make the old 3GB drives work
set scsi_options & ~0x80

You will have to reboot to have this change take effect.

        Chris Marble <cmarble@orion.ac.hmc.edu

==================================================================================
 
Diable the tahhed command queueing feature in
the kernel's scsi configuration files.

        bismark@alta.Jpl.Nasa.Gov (Bismark Espinoza)

=================================================================================
We've seen tagged queuing and tagged command timeouts a number of times.
The
following might help you out (from Sunsolve I recall):

------------- Begin Forwarded Message -------------

...in Solaris, when the disk controller is fully populated with targets
or having very fast disks (e.g., RAID devices), commands can be queued
up too fast (and reach the limit of 256) for sd driver to handle.
Once this condition is met, tagged command timeouts/retries or SCSI
transport failure messages often are displayed:

-> WARNING: /io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000 (esp1):
-> Disconnected tagged cmds (1) timeout for Target 1.0
-> WARNING: /io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000/sd@1,0
(sd16):
-> Error for command 'write' Error Level: Retryable
-> WARNING: /io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000/sd@3,0
(sd3):
-> SCSI transport failed: reason 'timeout': retrying command
-> WARNING: /io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000/sd@3,0
(sd3):
-> unix: SCSI transport failed: reason 'incomplete': retrying command

Setting sd_max_throttle to use a much smaller value, such as < 256, can
fix
the problem.

To what value should sd_max_throttle be set? That depends on how many
SCSI
targets are in the system. To have total queued commands < 100 can be a
workable rule (e.g., if there are 6 fast SCSI targets), and if
sd_max_throttle
is set to be 16, the total queued commands can be 96. If tagged
command
timeouts still are seen, then in /etc/system:

   set sd:sd_max_throttle = 16

PRODUCT AREA: Kernel
PRODUCT: Config
SUNOS RELEASE: Solaris 2.4
HARDWARE: any

------------ End Forwarded Message -------------

I added the folowing to the machine's /etc/system file, followed by a
reboot:

* Solaris sd driver taq queueing problems/sd_max_throttle
(default=256)
* Solution: set sd_max_throttle, in /etc/system, to a lower value
* Total value is this value x no. of SCSI targets:
                set sd:sd_max_throttle = 16

        Kitty Ferguson <ferguson@jabba.hao.ucar.edu>

==================================================================================
There are known issues around the tagged queueing. Some people turn it
off, some throttle it down. Do you have access to Sunsolve? Do you
know where the sun-managers archives are? Here's what we do, in
/etc/system:

*
* Solaris sd driver taq queueing problems/sd_max_throttle
(default=256)
* Solution: set sd_max_throttle, in /etc/system, to a lower value
* Total value is this value x no. of SCSI targets:
*
                set sd:sd_max_throttle = 16

        Leonard Sitongia <sitongia@jabba.hao.ucar.edu>

==================================================================================
If you've already swapped controllers and cables, it can only be a few
other things.

1. You're SCSI chain is too long....

2. You're SCSI chain is not terminated properly. For long chains you
might try a Forced Perfect Terminator.

3. The drive itself is bad.

4. Another device on the chain is causing too much noise within the
chain. Take off every device that is not needed to see if the problem
still occurs with the internal SCSI target 0 device.

        john@starinc.com (John Malick)

=================================================================================
        I've got the same problem, but not to that degree, usually the
machine
will have for about 2 mins (Timeout length) and then pop the error up
and
then continue normal operation... If you do find an answer to this
problem
please post it in a summary or let me know what you did to fix it.
Thanks!

        Stephen Frost <sfrost@mitretek.org>

============================================================================
Your problem sounds like your disk does not fully support the fast SCSI
2
command set (which includes tagged queuing).
When the Sun comes up and detects a fast SCSI 2 device, the Sun assumes
full
compliance. You can "turn" off tagged queuing on the Sun bu putting:

set scsi_options ~0x80

in your /etc/system file and then rebooting. This tells the Sun no
to use tagged queuing on the SCSI bus. There is a way to disable it
on a per device basis, but I don't recall that way off the top of my
head. You can check the archives for that one. What I do recall is
that
it involves editing /kernel/drv files.

Hope this helps.

        Tom Doong <doong@tomsnet.com>

===============================================================================
===============================================================================
Patrick McCook
cookie97@flash.net
Systems Administrator
Flashnet Communications



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:33 CDT