The consensus of opinion regarding the problem listed below
(confirming my own thoughts) was that the SCSI chain is simply
I've re-formatted and analysed the disk, and that shows no problems
whatsoever. The root file system was recovered from tape, and I put
the system back on line. We've had no further problems, but we HAVE
placed an immediate order for a second SCSI controller which will be
arriving in the next few days and should hopefully clear up the
problems for good.
The range of the replies suggests that there is a great deal of
confusion as to exactly how to manage long SCSI chains. Everyone
agrees that the chain should be as short as possible, and that the
wide drives should be first in the chain, but some issues are just
not well understood.
1) Does a wide-to-narrow SCSI adapter cable properly terminate the
'wide' pins which go no further? This was something I was worried
about from the moment we fitted it, since I suspect not.
2) Someone suggested we synchronise the RPMs of the disk drives on
the chain. Was this a serious suggestion?!
3) Someone said that the speed of the chain is totally dependent
upon the speed of the slowest device on that chain. I don't think
this is true, because the SCSI adapter negotiates transfer rates
with each device entirely separately. (Unlike IDE.)
4) Someone suggested disabling tagged command queueing, but I don't
think this would help. All the disks can handle tagged command
queueing okay with the exception of the CDROM and the Tape drive.
Thanks to the following for useful input:
Al Hopper firstname.lastname@example.org
Brad Young email@example.com
Kevin Sheehan firstname.lastname@example.org
Rich Smith email@example.com
-- # -+- Matthew Reynolds, Contract Research Assistant -+- # # -+- Aston Space Geodesy -+- # # Email: reynolmd @ sun.aston.ac.uk Web: http://www.sat.aston.ac.uk/ # # Phone: +44 (0)121-359-3611 x4552 Fax: +44 (0)121-333-3389 #
------- Start of forwarded message ------- Return-path: <firstname.lastname@example.org> Date: Wed, 20 Jan 1999 13:01:00 GMT From: Matt Reynolds <email@example.com> To: firstname.lastname@example.org CC: Matt Reynolds <email@example.com>, Phil <firstname.lastname@example.org> Subject: Ultra 2/170 2GB root disk faulty?
I'm looking for some advice on the following problem - hopefully someone can confirm my theories on this.
We've got an Ultra 2/170 workstation with Solaris 2.6 and 105181-04 kernel patch. The internal disk is a 2.1 GB Seagate (I think!). This machine has a FULL SCSI chain, all external apart from the boot disk:
c0t0d0 SEAGATE ST32550W SUN2.1G 2G (SCSI II,wide) **** c0t1d0 SEAGATE ST118273W 18G (SCSI III,wide) c0t2d0 MICROP 1991-27 1128RQ 9G (SCSI II,wide) c0t3d0 SEAGATE ST12400N SUN2.1G 2G (SCSI II,narrow) c0t5d0 MICROP 1991-27 1128RF 9G (SCSI II,narrow) c0t6d0 Plextor 4x CDROM drive (SCSI I, narrow) c0t4d0 HP 12GB DAT C1537A (SCSI II,narrow)
The wide disks are first on the chain, and the narrow drives are at the end. The chain is terminated with a Sun narrow SCSI terminator. External cabling does not exceed 4m. The internal disk is marked **** in the above list.
None of the external drives have reported problems to the system. Yesterday, I received the following error message:
Jan 19 14:54:38 geodesy : WARNING: /sbus@1f,0/SUNW,fas@e,8800000 (fas0): : Target 0 reducing sync. transfer rate : WARNING: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0 (sd0): : Error for Command: write(10) Error Level: Retryable : Requested Block: 1070416 Error Block: 1070416 : Vendor: SEAGATE Serial Number: 02540364 : Sense Key: Aborted Command : ASC: 0x47 (scsi parity error), ASCQ: 0x0, FRU: 0x3
Jan 19 14:55:51 geodesy : fas: 0.0: cdb=[ 0x2a 0x0 0x0 0x10 0x54 0x50 0x0 0x1 0x0 0x0 ] : fas: 0.0: cdb=[ 0x2a 0x0 0x0 0x10 0x56 0x50 0x0 0x1 0x0 0x0 ] : fas: 0.0: cdb=[ 0x2a 0x0 0x0 0x10 0x57 0x50 0x0 0x1 0x0 0x0 ] : fas: 0.0: cdb=[ 0x8 0x7 0xf1 0xee 0x2 0x0 ] : fas: 0.0: cdb=[ 0x2a 0x0 0x0 0x10 0x55 0x50 0x0 0x1 0x0 0x0 ] : WARNING: /sbus@1f,0/SUNW,fas@e,8800000 (fas0): : Disconnected tagged cmd(s) (5) timeout for Target 0.0 : WARNING: /sbus@1f,0/SUNW,fas@e,8800000/sd@0,0 (sd0): : SCSI transport failed: reason 'timeout': retrying command
This disk is now running at 11.430MB/sec SCSI transfer rate, which is SCSI I,wide. 'scsiinfo' reports that the drive is 'noisy'.
Errors have been reported on this disk previously, but they coincided with a cooling fan failing in one of the external drives, causing problems on the SCSI chain, so I assumed this was why the internal disk was re-syncing at a slower SCSI transfer rate.
These error messages seem to have coincided with a problem with a system directory. Users discovered that they could no longer compile software on the system, and I traced this to the following directory:
/usr/ccs/lib: [snip] ?--------- 0 root root 0 Jan 1 1970 libcurses.a ?--------- 0 root root 0 Jan 1 1970 libform.a ?--------- 0 root root 0 Jan 1 1970 libgen.a ?--------- 0 root root 0 Jan 1 1970 libl.a ?--------- 1 root other 4294967297 Jan 1 1970 libld.so.2 - -rwxr-xr-x 1 bin bin 110696 May 5 1998 liblddbg.so.4 ?--------- 0 root root 0 Jan 1 1970 libmalloc.a ?--------- 0 root root 0 Jan 1 1970 libmenu.a [snip]
Obviously, the directory itself has been overwritten with zeroes. I'll recover this directory from tape, but my question is (finally):
Could the problem be:
1) the length of the SCSI chain (i.e. get another SCSI card), or is it more likely to be (as I suspect)
2) a problem root disk (i.e. replace the root disk and the problems will go away).
Apologies for the length of this posting, but there is a lot of information which I feel is relevant to this problem.
Thanks in advance,
M. Reynolds (part time sys-admin looking for a job!).
- -- # -+- Matthew Reynolds, Contract Research Assistant -+- # # -+- Aston Space Geodesy -+- # # Email: reynolmd @ sun.aston.ac.uk Web: http://www.sat.aston.ac.uk/ # # Phone: +44 (0)121-359-3611 x4552 Fax: +44 (0)121-333-3389 # ------- End of forwarded message -------
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:14 CDT