SUMMARY File-system: Strange errors

From: Morten Krabbe Barfoed (morten@copernicus.dsri.dk)
Date: Tue Jul 12 1994 - 04:09:07 CDT


My original posting:

: Dear sunmanagers.
:
: We're experiencing some very scary errors on one of
: our disks: sometimes when we (for one or other reason)
: reboots the system, we are forced to run fsck to
: repair inconsistencies in the filesystem when it
: comes up.
:
: The inconsistencies are bad reference counts and the
: like. But after doing fsck, and coming up again, we
: often find that files have swapped names ! The other
: day it was critical, as the /vmunix-file had been
: replaced by 8 Kbytes of something that definitely
: was no kernel.
:
: Normally it is on the same disk, indicating that it
: could be a faulty disk, perhaps suffering from old
: age (it's heavily used for more than 3 years now).
:
: Still, running format/analyze/read doesn't report
: errors.
:
: So, my 1. Q:
: Shouldn't format/analyze/read report errors
: if the disk is faulty. Or should I use the
: format/analyze/test ??
:
: 2: This is perhaps the most interesting question:
: could it be anything but a faulty disk ??????
: I have the feeling, that the errors are more,
: if we have not done a sync before booting: ie.
: booting just by using 'Stop a', not sync'ing,
: results in more errors.....but I'm not sure.
:
: 3: Other suggestions ??
:
:
: We're considering two things now: reformatting the disk,
: which is rather cumbersome, or bying a new disk, which
: is expensive.
:
: ----------
: System: SUN Server 470, running SUNOS 4.1.3.
: Disk: SUN0669.
: ----------
:

Now, the answers I got can be summarized as follows:

a: Doing an ungracefull halt will definitely result in lost
   chains, and incorrect references. Using 'stop a' is a BAD
   idea, as one advisor puts it. Still, fsck should be able
   to correct errors introduces this way '9 times out of 10'.

b: D.Mitchell@dcs.shef.ac.uk suggested me to use the command
   'dd if=/dev/rsdxn of=/dev/null bs=64k conv=noerr' to check
   the disk. I have done that for each partition on the disk,
   and it did not report any errors.

c: The format/analyze/*-test will not always find errors. The
   most thorough way to check the disk in that program is using
   the purge-option. I had considered that, but hesitated, as it
   destroys the entire content on the disk (which is as stated
   a boot-disk).

   In this context, jhunter@pcs.cnu.edu informed me, that some
   errors on the disk may be considered 'repairable' by format,
   but that format will pause about half a second when finding
   such a block. So I guess the message is, that by analyzing
   the disk, closely following the reports from the format-
   program, one can notice if and where (in which block) it
   pauses, then limit the analysis to the neighbouring blocks.
   If the pausing is there everytime, one can manualy add the
   block to the defect-list.

   epl@Kodak.COM writes very badly about format/analyze, even
   calling it pure trash. It would be interesting to get some
   comments on that ! Anyway, he's not alone to critize it.
   
d: ppt!drc@uunet.uu.net suggested me to evaluate the SCSI-con-
   troller by simply replacing it with another controller, and
   then see if the problems die out. He experienced similar
   problems and this solved his problems. glenn@uniq.com.au
   suggested to check the SCSI-cables, or if possible just
   replace them.
 
e: One suggested overlapping partitions. The disk that causes
   the problems doesn't have overlapping partitions, so I rule
   that out.

f: Many emphasized that I should not buy a disk before I have
   checked the current disk, controllers, cables and whatever
   else thoroughly. Don't act upon mere suspicion, but get proof.

I'll now do the following:

1: Change habbits when it comes to booting: making sure that all
   users log off before we shutdown, and be sure to do filesystem
   synchronization.

2: Check if I have all relevant patches to the OS, and if not, get
   and install them.

3: If that doesn't help: try to change SCSI-controller, I don't
   know about the cabling, as it's an internal disk. But if there
   are cables that might be replaced I'll try that as well.

4: If that doesn't help: Do a complete reformatting of the disk.

5: If that doesn't help: Buy a new disk.

Thanks a lot you guys and girls:

harishm@pcsdnfs1.eq.gs.com
se@comp.lancs.ac.uk
zegarac@gdls.com
D.Mitchell@dcs.shef.ac.uk
rob.e.allan@hydro.on.ca
jhunter@PCS.CNU.EDU
ppt!drc@uunet.uu.net
fetrow@biostat.washington.edu
glenn@uniq.com.au
kinscoe@ccmail.crc.com
epl@Kodak.COM
cross%spuddy@britain.eu.net

from:

Morten Krabbe Barfoed

Danish Space Research Institute phone: +45 42 88 22 77 (switch-board)
Gl. Lundtoftevej 7 phone: +45 45 87 40 77 - 161 (direct)
DK 2800 Lyngby FAX: +45 45 93 02 83
Denmark TELEX: 37 198

                                        e-mail: morten@dsri.dk
 



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:09:05 CDT