SUMMARY: What are "header not found" disk errors?

From: Leonard Sitongia (sitongia@hao.ucar.edu)
Date: Thu Jan 10 1991 - 11:25:29 CST


Subject: SUMMARY: What are "header not found" disk errors?

I thank you all for your thoughts on this problem. There were a variety of
explanations, from disks, controllers, cables, and a controller ECO. A
number of people have seen these on Fujitsu and CDC disk drives.

I noticed later that all the errors were on head #11, implying a real disk
problem. Another drive is reporting a few errors each day, mostly on head
#6. A third drive has errors of 5 different heads at different parts of
the disk, so I'm going to replace that one.

I had reported:

me> I've seen problems with disks which produced the following kinds of error
me> messages:
me>
me> Jan 6 03:15:42 hao vmunix: xd0c: write retry (header not found) -- blk #718071, abs blk #718071
me> Jan 6 03:16:03 hao vmunix: xd0c: read retry (header not found) -- blk #500910, abs blk #500910

Here are excerpts from the replies:
--------------------------------------------------

Date: Tue, 08 Jan 91 11:48:49 PST
From: "Dean S. Messing" <deanm%medulla.labs.tek.com@RELAY.CS.NET>

dean> Leonard,
dean> We have been experiencing the problem you describe
dean> along with a host of other (possibly related) problems
dean> for months. Sometimes reformatting fixes the problem,
dean> often for days or weeks. Then, head header error messages,
dean> bad block messages, or some other message (e.g. inodes full)
dean> will begin again. Sometimes the system ends up crashing.
dean> We are running a CDC SMD 9720-850 - a disk almost identical
dean> to yours except for size. We have replaced cables, controller,
dean> drive boards, and even the drive itself. CDC was good enough to
dean> loan us a spare for a month. The problem did not go away,
dean> although on the loaner disk we ran flawlessly for almost
dean> 4 weeks! After the loaner was returned, CDC (Seagate) did
dean> extensive checks on their disk and found no problems.
dean>
dean> The thing we did learn from all our pain is that the
dean> disk was often (but not always) going off-line when these
dean> problems occurred. One day I just happened to be sitting
dean> near the disk when an error occurred and when I looked at
dean> the disk's front panel, the on-line light was irregularly
dean> blinking on and off. After this, we noticed the same
dean> behaviour very often when disk problems were happening.
dean> The light never blinked when all was well.

From: curt@ecn.purdue.edu (Curt Freeland)
Subject: Re: What are "header not found" disk errors?

curt> I was seeing the same thing on some of our XD disk controllers. We have
curt> been seeing this for 2 years now! I recently got my hands on a Xylogics
curt> Field Change Notice that says (in part):
curt>
curt> Date: 10/10/89 ECO No: 1757 FCN 753-011
curt> Title: Busy hang - disk bus loading problem
curt>
curt> A bad head address is put out by the 753 during head tags. Symptoms
curt> reported due to this occuring include: "disk sequencer errors",
curt> "drive off cylinder", "header not found", and in many cases the
curt> controller will hang busy. One of the most common failures is
curt> during the verify pass of Sun's format - verify will stop running
curt> and the controllers busy LED will be on solid. This condition is
curt> caused by a D.C. loading problem on the 753's internal disk bus.
curt>
curt> The fix is to pull out a SIPP resistor pack, and replace a PAL chip.
curt> You can check chip location D5 and see if the chip label has the number
curt> "1085" or "180-001-085" on it, and the SIPP resistor RP11 should be
curt> missing from the board. You should also make sure you have the large
curt> metallic heat-sink with the diodes in it if you have a 753 controller.
curt> Without the heatsink, you could burn up your controller among other things.

Date: Tue, 08 Jan 91 17:24:59 EST
From: trinkle@cs.purdue.edu

trinkle> What you have is a media failure on the disk. Most likely it is
trinkle> a head crash. This means there is physical damage to part of the
trinkle> recording surface of the disk and/or one of the read/write heads.
trinkle> Once one area of the surface is damaged, there is usually some
trinkle> particles (dust) floating around in the sealed drive as a result of
trinkle> the abrasion of the head against the surface. This dust will then
trinkle> cause more abrasion between the head and other areas of the surface.
trinkle> If the head is badly damaged, then even without dust, the damage to
trinkle> the head may cause physical damage to the surface in other areas.

From: era@niwot.scd.ucar.EDU (Ed Arnold)
Date: Tue, 8 Jan 91 15:30:59 MST

ed> ...get the HDA replaced.

Date: Tue, 08 Jan 91 16:11:28 -0600
From: Gordon C. Galligher <oddjob!oconnor!trevise!gorpong@ncar.UCAR.EDU>

gordon> You have lost, or are losing, your controller, NOT your drive. We see these
gordon> errors all the time with the Xylogics 450/451 controller cards. Replace the
gordon> controller, and things should be fine. Beware that when replacing a
gordon> controller card, it is a "good idea" to reformat the drive. If the drive
gordon> contains data which you cannot do without, then I suggest bringing the system
gordon> up single user mode and dump'ing what you need, and then reformatting. It is
gordon> an extra step, but you are then guaranteed of a clean system.
gordon>
 
Date: Tue, 8 Jan 91 19:08:38 PST
From: aldrich@sunrise.Stanford.EDU (Jeff Aldrich)

jeff> Similar problems I've had in the past have been due to flaky disk
jeff> controllers or, more rarely, bad cabling or bad connector. Lots of
jeff> luck!

Date: Wed, 09 Jan 91 09:57:23 +0000
From: James Pearson <jcpearso@ps.ucl.ac.uk>

james> I had a very similar problem about a year ago with one of our Eagles
james> (bad blocks appearing all over the disk, reformatting occasionally
james> working and finding new bad blocks, disk being OK for a couple of days
james> then failing with the same problem etc).
james>
james> It turned out to be a cable problem. I replaced all the cables and the
james> problem went away.

From: mailrus!umich!samsung!uunet!anagld.analytics.com!rcsmith@ncar.UCAR.EDU (Ray Smith)
Date: Wed, 9 Jan 91 7:12:45 EST

ray> Leonard,
ray> I can't answer your question directly from first hand experience
ray> but I did run your error message through my full-text archives
ray> of sun-spots, sun-managers, sun-nets and sun-flash. I came up with the
ray> following messages which appeared in August 1990.
ray>
ray> I hope they help,
ray> Ray
ray>

me> I haven't included this text, because you have probably already
me> seen it and it is available in the archives

Date: Wed, 9 Jan 91 08:56:21 -0500
From: eap@bu-pub.bu.edu (Eric A Pearce)

eric> It wasn't clear to me from your letter, but it sounded like you have
eric> replaced a disk drive and the new one failed in the same manner as
eric> the last (?). If this is the case, I would look elsewhere for the
eric> problem - such as replacing the disk controller and/or cables.
eric> Large fluctuations in room temperature can also cause errors.
eric> We have many of the drives you mention and they run for years without
eric> errors.

Date: Wed, 9 Jan 91 14:29:39 GMT
From: dit@kc.aberdeen.ac.uk

david> We have had two CDC drived go the same way, and a third disk away for checking at
david> the moment. My understanding is that each disk surface is divided into tracks, and
david> each track divided into sectors. Each sector is written with certain information,
david> such as sector number, checksum etc, and a space left for data. This corresponds to
david> the header followed by the sector size you expect. A 512 byte sector may actually
david> be between 600 and 700 bytes once you allow for the rest of the junk required.
david>
david> I am told the surface of our disks started to 'flake' or 'peel', and in any case
david> get thinner. This leads to problems reading the information, but not always, hence
david> the intermittent nature of the problem. This generates the 'header not found' errors.
david> Reformatting writes new (seemingly stronger) information to the disk which can be
david> read OK. The final symptoms are the 'flakes' of disk surface contaminating other
david> areas of the disk in a catastrophic manner.
david>
david> As an aside, I have also had disks damaged by being moved suddenly. This can either
david> destroy the disk or the head, obliterate part of the surface, or just make bits of
david> the disk unreliable, but formatting usually fixes all but the destroyed disk head.
david>
david> David Tock dit@uk.ac.aberdeen.kc / \ /

Date: Thu, 10 Jan 91 11:46:41 -0500
From: mike@park.bu.edu (Michael Cohen)

mike> I usually see this stuff with a winchester with degenerating media.
mike> I would run format on the disks in question after backing them up.



This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:09 CDT