I posted asking for info on read errors on my IPI system disk
(appended below).  If you have ~2 year old IPI I/O systems on Suns it
is time to start worrying about this one.  I have 20 replies overnite
from people who have/had this problem, including a summary of a long
thread in the Internet news group comp.sys.sun.hardware.  All said
that a reformat was required, though several thought the fact that the
problem was limited to a single partition suggested some specific bad
spots on the drive as well.
There seems to be a disk controller problem.  Sun reformats the drive
and upgrades to a Rev 4 disk controller to solve this.  If you don't
upgrade the controller it will be out of sync again in 6-10 months and
you get to reformat every 6 months forever.  Our maintainers said
there was some sort of cache area on the Sun controller that wasn't
functioning properly and was a major player in the sorts of transfers
that happen on the /usr partition, but not on the swap partition
(which logs no errors but is on the same drive).  A couple of repliers
thought it had to do with head-tracking getting out of sync with the
controller.  I so software.  I don't understand any of it.
My third-party maintainer is gonna be on site with an up-to-rev
controller card the first part July when we go to Solaris (2.2 will be
out and Matlab will be available :-).  We will swap out the
controller, analyse and reformat the disk, and load the new system.
This looks like it will minimize the chances that I will have to do
the rebuild twice.
The list comes through again.  Clearly the best around.  Thanks to the
following.  They are still comming in, but the pattern seems to be clear:
From: jsm@cirrus.com (John Mizzi)
From: B.Rea@csc.canterbury.ac.nz
From: strombrg@hydra.acs.uci.edu
From: "Jim Phillips GE-AIT Workstation System Manager, workstations are us!!" <PHILLIPS@syr.ge.com>
From: tom@yac.llnl.gov
From: Doug Neuhauser <doug@perry.berkeley.edu>
From: Keith Pilotti <kfp@qualcomm.com>
From: p.elliott@trl.OZ.AU (Paul Elliott)
From: markk@internic.net (Mark Kosters)
From: jor@ts.se (Joakim Rastberg)
From: leif@control.lth.se (Leif Andersson)
From: dennett@Kodak.COM (Charles R. Dennett)
From: miesch@bns101 (Ed Miesch ph2493)
From: anonymous
From: daniel@CANR.Hydro.Qc.CA (Daniel Hurtubise)
From: "George E. Turner" <geturner@sol.aer.com>
From: barnes@sde.mdso.vf.ge.com (Barnes William)
From: kd@redwood.cray.com (Kevin Drysdale)
From: Christian Lawrence <cal@soac.bellcore.com>
From: dbare@baosc.com (Dennis Bare)
From: syntllct!zenor@uunet.UU.NET (Jim Zenor - x2871)
I will send the mail folder to anyone who sends me email requesting it.
========= original posting ==========================
On our Sparcstation 4/470, at 4.1.1 and serving 6 diskless clients
plus user file storage for another 10 machines, I am logging read
errors on the /usr partition on the IPI drive on my file server:
... vmunix: id000g: block 99184 (817720 abs): read: Conditional Success.\
     Data Retry Performed. 
o I have logged ~800 errors in the last 30 days on 350 different blocks.
o I have tried renaming the files containing the bad blocks and putting in
    fresh copies of the bad files.  The error rate continues about the same.
o No other partitions on the drive is affected.  
O Block numbers are widely scattered.  
My hardware maintainer wants me to pull down the server and reformat
the disk.  He says the bad block list is probably exhausted and that
is causing the problem.  Seems unlikely to me since only that
partition is affected and removing bad files from activity does not
affect the error rate.  The disk is the system disk and I don't want
the hassle of rebuilding it unless it is absolutely necessary.  It is
the system disk and I have no alternate disk that would let me keep
the system up while I do the work.  I don't want to spend the down
time unless it is really likely to solve the problem.
--Grant Basham (305)361-4026 University of Miami grant@oj.rsmas.miami.edu RSMAS Computer Facility/Systems
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:07:49 CDT