Thanks to all those who responded to my query. We have it working now.
The problems turned out to be with the Tagged Command Queueing.
We tried setting the MAX_THROTTLE to 16, but the errors continued.
Turning Tagged Command Queueing off resolved the problem.
The specific fix was to create /kernel/drv/fas.conf with the follow entry:
scsi-options=0x378;
We will continue testing with MAX_THROTTLE set to lower values to see
if we can enable tagged queueing at all.
Original message:
------------------------------------------------------------
We have several systems as follows:
Ultra 2 computer
Sunswift F/W SCSI
4 to 8 IBM 36GB LVD disk drives
striped with DiskSuite
2 AIT tape drives
Solaris 2.5.1
The systems are behaving poorly with random timeouts on the scsi bus
(which reset the tape drives). We have changed every cable, terminator and
scsi card. All the computer systems are getting similar timeouts.
Has anyone else had problems with IBM 36GB LVD drives? Any suggestions?
------------------------------------------------------------
Here are some of the error messages we were getting:
Oct 17 16:49:22 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000 (fas2):
Oct 17 16:49:22 mars unix: Disconnected tagged cmd(s) (14) timeout for Target 13.0
Oct 17 16:49:22 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@d,0 (sd42):
Oct 17 16:49:22 mars unix: SCSI transport failed: reason 'timeout': retrying command
Oct 17 17:58:42 mars unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000 (fas1):
Oct 17 17:58:42 mars unix: Disconnected tagged cmd(s) (18) timeout for Target 10.0
Oct 17 17:58:42 mars unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@a,0 (sd24):
Oct 17 17:58:42 mars unix: SCSI transport failed: reason 'timeout': retrying command
Oct 17 18:00:23 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000 (fas2):
Oct 17 18:00:23 mars unix: Connected command timeout for Target 9.0
Oct 17 18:00:23 mars unix: fas: Cmd dump for Target 9 Lun 0:
Oct 17 18:00:23 mars unix: fas: cdb=[ 0xa 0x6 0x41 0x60 0x70 0x0 ]
Oct 17 18:00:23 mars unix: fas: State=SELECT Last State=FREE
Oct 17 18:00:23 mars unix: fas: pkt_state=0x0 pkt_flags=0x4000 pkt_statistics=0x60
Oct 17 18:00:23 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@9,0 (sd38):
Oct 17 18:00:23 mars unix: SCSI transport failed: reason 'timeout': retrying command
Oct 17 18:00:23 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@9,0 (sd38):
Oct 17 18:00:23 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:00:23 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@b,0 (sd40):
Oct 17 18:00:23 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:00:23 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@d,0 (sd42):
Oct 17 18:00:23 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:00:23 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@f,0 (sd44):
Oct 17 18:00:23 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:02:43 mars unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000 (fas1):
Oct 17 18:02:43 mars unix: Disconnected tagged cmd(s) (25) timeout for Target 14.0
Oct 17 18:02:43 mars unix: WARNING: /sbus@1f,0/SUNW,fas@0,8800000/sd@e,0 (sd28):
Oct 17 18:02:43 mars unix: SCSI transport failed: reason 'timeout': retrying command
Oct 17 18:04:14 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000 (fas2):
Oct 17 18:04:14 mars unix: Connected command timeout for Target 11.0
Oct 17 18:04:14 mars unix: fas: Cmd dump for Target 11 Lun 0:
Oct 17 18:04:14 mars unix: fas: cdb=[ 0xa 0x8 0x73 0xb0 0x70 0x0 ]
Oct 17 18:04:14 mars unix: fas: State=SELECT Last State=FREE
Oct 17 18:04:14 mars unix: fas: pkt_state=0x0 pkt_flags=0x4000 pkt_statistics=0x60
Oct 17 18:04:14 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@9,0 (sd38):
Oct 17 18:04:14 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:04:14 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@b,0 (sd40):
Oct 17 18:04:14 mars unix: SCSI transport failed: reason 'timeout': retrying command
Oct 17 18:04:14 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@b,0 (sd40):
Oct 17 18:04:14 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:04:14 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@d,0 (sd42):
Oct 17 18:04:14 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:04:14 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@f,0 (sd44):
Oct 17 18:04:14 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:09:15 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000 (fas2):
Oct 17 18:09:15 mars unix: Connected command timeout for Target 11.0
Oct 17 18:09:15 mars unix: fas: Cmd dump for Target 11 Lun 0:
Oct 17 18:09:15 mars unix: fas: cdb=[ 0xa 0x11 0xcc 0xb0 0x70 0x0 ]
Oct 17 18:09:15 mars unix: fas: State=SELECT Last State=FREE
Oct 17 18:09:15 mars unix: fas: pkt_state=0x0 pkt_flags=0x4000 pkt_statistics=0x60
Oct 17 18:09:15 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@9,0 (sd38):
Oct 17 18:09:15 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:09:15 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@b,0 (sd40):
Oct 17 18:09:15 mars unix: SCSI transport failed: reason 'timeout': retrying command
Oct 17 18:09:15 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@b,0 (sd40):
Oct 17 18:09:15 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:09:15 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@d,0 (sd42):
Oct 17 18:09:15 mars unix: SCSI transport failed: reason 'reset': retrying command
Oct 17 18:09:15 mars unix: WARNING: /sbus@1f,0/SUNW,fas@2,8800000/sd@f,0 (sd44):
Oct 17 18:09:15 mars unix: SCSI transport failed: reason 'reset': retrying command
============================================================
Here are a few of the key responses that we got. There were also many
others that were helpful.
------------------------------------------------------------
From: Kitty Ferguson <ferguson@hao.ucar.edu>
Subject: Re: 36 GB LVD disk drives
Pete,
Have you set the sd_max_throttle down? Depends on what timeout errors =
you're=20
talking about...
Re: Errors concerning Disconnected tagged cmds (1) timeout could have to =
do with=20
disk speed, with the fix being:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
...in Solaris, when the disk controller is fully populated with targets
or having very fast disks (e.g., RAID devices), commands can be queued
up too fast (and reach the limit of 256) for sd driver to handle.
Once this condition is met, tagged command timeouts/retries or SCSI
transport failure messages often are displayed:
-> WARNING: /io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000 (esp1):
-> Disconnected tagged cmds (1) timeout for Target 1.0
-> WARNING: /io-unit@f,e1200000/sbi@0,0/dma@0,81000/esp@0,80000/sd@1,0 =
(sd16):
-> Error for command 'write' Error Level: Retryable
-> WARNING: /io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000/sd@3,0 =
(sd3):
-> SCSI transport failed: reason 'timeout': retrying command
-> WARNING: /io-unit@f,e0200000/sbi@0,0/dma@0,81000/esp@0,80000/sd@3,0 =
(sd3):
-> unix: SCSI transport failed: reason 'incomplete': retrying command
Setting sd_max_throttle to use a much smaller value, such as < 256, can =
fix
the problem.
To what value should sd_max_throttle be set? That depends on how many =
SCSI
targets are in the system. To have total queued commands < 100 can be a
workable rule (e.g., if there are 6 fast SCSI targets), and if =
sd_max_throttle
is set to be 16, the total queued commands can be 96. If tagged command
timeouts still are seen, then in /etc/system:
set sd:sd_max_throttle =3D 16
PRODUCT AREA: Kernel
PRODUCT: Config
SUNOS RELEASE: Solaris 2.4
HARDWARE: any
------------- End Forwarded Message -------------
We added to to the system's /etc/system file, to be followed by a =
reboot:
* Solaris sd driver taq queueing problems/sd_max_throttle =
(default=3D256)
* Solution: set sd_max_throttle, in /etc/system, to a lower value
* Total value is this value x no. of SCSI targets:
set sd:sd_max_throttle =3D 16
Kitty
--Kitty Ferguson System Administrator - CSMT
ferguson@hao.ucar.edu NCAR - High Altitude Observatory=09
tel: (303)497-1556 P.O. Box 3000
fax: (303)497-1589 Boulder, CO 80307-3000
We have several systems as follows:
Ultra 2 computer
Sunswift F/W SCSI
4 to 8 IBM 36GB LVD disk drives
striped with DiskSuite
2 AIT tape drives
Solaris 2.5.1
The systems are behaving poorly with random timeouts on the scsi bus
(which reset the tape drives). We have changed every cable, terminator and
scsi card. All the computer systems are getting similar timeouts.
Has anyone else had problems with IBM 36GB LVD drives? Any suggestions?
------------------------------------------------------------
Date: Tue, 19 Oct 1999 17:16:42 -0400
From: "Craig H. Anderson" <craiga@ggise.com>
Subject: Re: 36 GB LVD disk drives
I had similar problems with multiple Fast-Wide drives on my Solaris 2.5
box (SPARC-20). You may want to try turning tagged command queueing
off.
You can do so in your /etc/system file with the following lines (what I
use):
set scsi_options = 0x3f8
set scsi_options & ~0x80
See SRDB ID: 10254 on sunsolve.sun.com for details.
Good luck,
Craig
------------------------------------------------------------------------ | Craig H. Anderson | craiga@ggise.com | | Systems Administrator-------------------------+----------------------| | Genesis Group / | www.ggise.com | | Biological Research Associates | www.biolresearch.com | |----------------------------------------------------------------------| | Voice/(813) 620-4500 FAX/(813) 620-4980 Tampa, FL 33619 | ----------------------------------------------------------------------
-- Pete Alleman C & C Technologies, Inc. Phone: 318-261-0660 Chief Scientist Lafayette, LA, USA Fax: 318-261-0192
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:29 CDT