My original query:
> I recently had to repair a bad block on a disk. Unfortunately,
> I now have a file with a block of zeros embedded somewhere in it.
> I know the block number, I need to find which file is using that block.
> I asked sun, they said that there's no command that gives that info.
> Has anyone got a program that can scan the i-nodes to find the block?
> The machine is running 4.0.3, but I have another disk playing up on a 3.4
> machine as well (I know - not even 3.5 !!!)
>
The answer, as many of you pointed out, is
icheck -b <bad_block#> /dev/r...
this lists (amongst other things), the i-node that refers to that block
you can then do
find /mount/point -xdev -inum nnnn -ls
or
ncheck -i nnnn /dev/r..
to find the file name(s) associated with that inode.
BTW, by a bizarre coincidence, the file I eventually asociated with
an (intermittent) bad block, was /bin/find itself!
Finally, Keith Farrar <keith%markets@net.uu.uunet> sent me a document
which I have included at the end, on the grounds that it might be
of interst to many people, even though its quite long.
Thanks to the countless people who replied (too many to list!)
Dave.
----- Begin Included Message -----
What File Has The Disc Error?
by John Walker
Revision 0 -- December 21st, 1989
ABSTRACT
========
When a single block or contiguous area on a Sun (or
other Unix) system's hard disc fails, one of the most
obvious and immediately important questions that
arises is "What file contains the error?". Amazingly,
there is no simple, standard utility that answers this
question, leaving the user knowing that some data have
been destroyed, but not what. If backups are current,
the user doesn't know what files to reload after the
failed area is reassigned to an alternate track or
made unavailable for allocation. This paper presents
a cookbook procedure, based on information provided by
Bob Elman, for determining which file contains a bad
disc block.
INTRODUCTION
============
When my hard disc presented me with its latest holiday surprise, I
ended up with 100% repeatable errors on a specific track, head, and
sector. Immediately after the error occurred, I ran an incremental
backup which, naturally, encountered read errors. At that point I had
a current set of backups from which I was perfectly willing to reload
or rebuild any files that occupied the area of the disc that had
failed, but I didn't know which files were involved. DUMP didn't tell
me, when it so kindly reported an error during the backup; even though
it clearly knows the INODE number it was dumping when the error
occurred, it didn't deign to print it.
Bob Elman explained the procedure one uses to find what file contains
a given disc block, and it worked just fine, telling me that the error
was in an executable file I could simply re-link after I'd fixed the
disc by reformatting the track that failed. Since the procedure is
less than obvious and nowhere explained in the Unix manuals I've seen,
I decided to write it down so I'd have it at hand the next time this
happened, and to help the next poor sucker victimised by a hard disc
failure. You might want to print this message on a piece of paper and
file it in your system administration manual--when you need it, you
may not be able to get it from a file on your disc.
FINDING THE FILE
================
We start out knowing that a hard disc contains one or more bad blocks.
The first symptom that something is wrong is usually Unix console
messages reporting I/O errors on the drive. Most of these I/O error
messages give the block number that failed but since Unix reads and
writes large buffers, these numbers should be considered as giving
only the general area of the actual error. The first step, then, is
to identify the actual blocks that contain the errors.
What Blocks Are Bad?
--------------------
(Sun specific.) Initially, note the drive number from the disc error
message. In a typical message like:
xd1c: write failed (header not found) -- blk #1317140, abs blk #1317140
the drive name is "xd1c". To find out what file system this
corresponds to, type "df", which will print something like:
Filesystem kbytes used avail capacity Mounted on
/dev/xd0a 15502 1946 12005 14% /
/dev/xd0h 514106 430020 32675 93% /usr
/dev/xd1c 659242 569911 23406 96% /usr2
/dev/xd0g 42406 8554 29611 22% /var
In this case, you can see that "xd1c" is mounted as your /usr2
filesystem. (The default mounting of file systems is given by the
file /etc/mtab, which you can type.)
Shut down your system and bring it up single user with "b -s". In
single user mode, run "format". When you fire up format, it asks you
to choose the disc you want to work on; pick the one from the error
message. For example:
throop# format
Searching for disks...done
AVAILABLE DISK SELECTIONS:
0. xd0 at xdc0 slave 0
xd0: <CDC 9720-850 cyl 1358 alt 2 hd 15 sec 66>
1. xd1 at xdc0 slave 1
xd1: <Fujitsu-M2372K cyl 743 alt 2 hd 27 sec 67>
Specify disk (enter its number): 1
selecting xd1: <Fujitsu-M2372K>
[disk formatted, defect list found]
Here, I've entered "1" to choose "xd1". (The "c" in the error number
is a partition name, but at this level format is working on the whole
disc.)
Next, we want to get the physical disc address of the block number
reported in the error message. Enter the "show" command, and type in
the error block number:
format> show
Enter a disk block: 1317140
Disk block = 1317140 = 0x141914 = (728/2/54)
This tells us that the block where Unix encountered the error was on
track 728, head 2, sector 54. Since we don't know precisely where the
error was, we'll sniff around the two surrounding tracks for errors.
Enter the surface analysis command:
format> analyze
and then enter "setup" to specify the parameters for the analysis:
analyze> setup
Analyze entire disk [yes]? no
Enter starting block number [0, 0/0/0]: 727/0/0
Enter ending block number [1347704, 744/26/66]: 729/$/$
Loop continuously [no]?
Enter number of passes [2]: 1
Repair defective blocks [yes]? no <========= INCREDIBLY IMPORTANT!!!! <===
Stop after first error [no]?
Use random bit patterns [no]?
Enter number of blocks per transfer [126, 0/1/59]: 1
Verify media after formatting [yes]?
Enable extended messages [no]?
Restore defect list [yes]?
Restore disk label [yes]?
Here we've set up to scan from the start of track 727 through the end
of track 729 (the "$" means "the highest number valid in this field"),
reading single sectors. If we were to use a larger blocks, the
precise location of the errors would be indeterminate. IT IS
ABSOLUTELY ESSENTIAL, SURPASSINGLY SO, THAT YOU ANSWER *NO* TO THE
"REPAIR DEFECTIVE BLOCKS" PROMPT. If fail to do this, the so-called
"read-only" test will go ahead and "repair" blocks on your disc,
possibly causing loss of data in files. So much for reasonable
defaults!
Now select the read-only surface analysis:
analyze> read
Ready to analyze (won't harm SunOS). This takes a long time,
but is interruptable with CTRL-C. Continue? yes
This will scan the tracks you've specified. Since we're only looking
at a few tracks, the comment about taking a long time is another lie.
This command should report the individual sectors with errors. If it
doesn't, welcome to the world of transient disc errors. If it does,
note the track, head, and sector numbers of all failing sectors on
paper, then leave the analyse command:
analyze> q
You can then convert those addresses back to block numbers with the
"show" command:
format> show
Enter a disk block: 728/2/22
Disk block = 1317108 = 0x1418f4 = (728/2/22)
Once you have the failing block numbers in hand, you're done with
format. This example has been for a disc with a single partition that
fills it entirely. If your disc has multiple partitions, you'll have
to convert these absolute block numbers to relative numbers based on
your partitioning of the disc. The partition/print command will show
the current partitioning, which can use to bias the cylinder numbers
into their partition-relative addresses.
What I-Node Owns That Block?
-----------------------------
On Unix, there is no one-to-one mapping of file names to areas on the
disc, since "hard links" can result in a given disc area belonging to
any number of named files. The Unix object that most closely
corresponds to the notion of a file in most operating systems is
called an "I-Node", and it's expressed as a number. The utility
"icheck", which was part of the semi-automatic assault guru-driven
file recovery facilities of Unix later largely supplanted by "fsck",
has the ability to determine what I-Node points to a given block. If
you know, for example, that blocks 1317108 and 1317110 on disc "xd1c"
contain errors, use the command:
/usr/etc/icheck -b 1317108 1317110 /dev/rxd1c
Bizarre, isn't it? It just scans numbers until it hits the "/" at the
start of the disc name. We specified "rxd1c" because naming the "raw
device" makes icheck run faster.
Icheck will crunch for some time, and if the specified blocks are part
of a file, it will print a line that gives, among other things, the
I-node of the file(s) that contain the given blocks. Note the I-nodes
on your paper, next to the block numbers. If no I-nodes were reported
by this procedure, the error block is not part of any currently
existing file.
What File Name(s) Correspond To That I-Node?
--------------------------------------------
With the I-Node number in hand, we can finally find out what file was
hit. If "icheck" has told us the error is in I-Node 87055, we use the
command:
/usr/etc/ncheck -i 87055 -a /dev/rxd1c
to find the file name. After a while, this will print something like:
/dev/rxd1c:
87055 /usr2/kelvin/acadexe/acad
and at last, the inscrutable is unscrewed! The error was in the
AutoCAD executable file, which I can simply re-link. If the file
hadn't been one so easily recreated, it would have to have been
reloaded from the most recent valid backup. Note that if a backup was
made after the error occurred, and that file was present on the
backup, an earlier backup should be used since the copy on the
post-error backup is almost certainly bad.
You can use "ncheck" to search for multiple I-nodes on one pass. For
example:
/usr/etc/ncheck -i 4142 4131 4102 -a /dev/rxd0g
/dev/rxd0g:
4102 /tmp/vm_fonts-n0
4131 /tmp/tty.txt.a00444
4142 /tmp/rmail
Repairing And Reloading
-----------------------
After the location and scope of the damage are established, you should
repair the disc errors and restore the damaged files. Since repair
procedures are highly system-dependent and, even on Sun systems,
differ depending on the type of disc controller and drive installed,
you must refer to the hardware documentation for your system for the
appropriate procedures.
Note that the Sun documentation talks about "repairing" sectors with
errors. Nobody I know can say for sure precisely what this means:
whether it's a process of assigning that sector's address to another
sector on an alternate track, clearing its availability bit in the
current bad spot list, marking it in the original defect list, or
what. In addition, the problems I encounter most frequently on hard
discs are destroyed headers due to failed writes (for example, when
the power fails during a write), which are best fixed by reformatting
the area containing the errors rather than discarding sectors which
have no physical defects.
In any case, after you've repaired the problem with the disc, you need
to delete all the files containing destroyed data and reload them from
their most recent backups. As noted above, don't use any backups of
error-containing files made after the error occurred, as they probably
contain the same errors as the disc controller was complaining about.
----------------------- End Included Text -------------------------------------
______________________________________________________________________
| Keith Farrar |
| AMIX Corporation |
| Palo Alto, CA "Apple is like the Chinese Cultural |
| (415) 856-1234 x217 Revolution conducted by people in |
| three-piece suits." |
| DOMAIN: keith@markets.amix.com -John Perry Barlow |
| UUCP: {uunet|sun|xanadu!}markets!keith |
----------------------------------------------------------------------
----- End Included Message -----
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:06:16 CDT