Hello,
Last week I asked how to force the system to require an
interactive fsck at the next reboot, as part of a break/fix
hardware test for prospective sysadmins. (Full text of
original message is at the end).
In the end, mgmt decided that the test would not be
necessary (whew), so I am now able to offer you this
summary. It should be noted that I did not try any
of these suggestions, as the test was cancelled.
My thanks to the following people for their speedy replies.
Kris (Unixboy)
Darren Dunham
Kevin Sheehan {Consulting Poster Child}
Michael Stapleton
Seferino Gardner
Rich Jankowski
Thad MacMillan
Dan Lowe
Dan Lorenzini
Annette Lee
Jerry Lu
Gary Jenson
David B. Harrington
Dan Brown
Toby A. Rider
Brett Lymn
Ian MacPhedran
Here are some of the utilities mentioned that can be used to
corrupt a filesystem
fsdb(1M)
unlink(1M)
clri(1M)
dd(1M)
Here are their suggestions and comments:
1. You can destroy the primary superblock on a partion then run
"fsck" to restore it from a backup superblock (block 32 is
the traditional first choice):
i. newfs /dev/rdsk/c?t?d?s? (create a new file system on a
slice if you don't wanna use an existing partition)
ii. fsck /dev/rdsk/c?t?d?s?
iii. dd if=/kernel/genunix of=/dev/rdsk/c?t?d?s? count=32 bs=512
(destroy the primary superblock on this partition)
iv. fsck /dev/rdsk/c?t?d?s? (you'll see error now)
v. fsck -o b=32 /dev/rdsk/c?t?d?s?
(repair the superblock from block 32)
If you reboot the machine right after step iv, you can force
the fsck to run.
2. Just change the reference to your raw device in the
/etc/vfstab. This will cause the fsck to fail, as the device
won't exist and this will drop you to the shell level. This
is a much safer method than intentionally corrupting a
filesystem. Who ever you interview will have to figure out
the correct device and eventually fsck the correct one. Good
way to test a junior person on fsck, format and basic file
system principles.
3. To do it in a repeatable way, look at fsdb(1M) and try your
hand at manually manipulating the filesystem. Great
opportunity, when you actually *want* to mess up the FS!
4. If you're looking for a real munge, grab 'fastfs', put your
/usr filesystem into the fast state and make changes.
create/delete many files in a directory and pull the plug.
It's not as repeatable, but you'll sure have the filesystem
in a very unclean state when it comes back.
5. I would create a toy file system on another partition first
(a bit safer) and then hit either a directory or inode part of
a cylinder group with (eg.)
dd if=/dev/zero of=/dev/rdsk/c0t0d0s4
This messes up the FS, but not very graceful stuff like
duplicate inodes, etc.
6. Use the unlink command to cause lost files. It removes the
directory entry but does not delete the file.
7. Make a directory somewhere in /usr then make another directory
in the one you just created. Get the inode for the first
directory (ls -il) then run clri(1M) on that inode. Make sure
this is on a test system, as there's always the possibility
that it'll really clobber /usr. You probably will still have
to just turn the power off to make sure the FS is dirty so
the fsck will happen on boot.
8. Tell me if I'm off base here but can't you just make an rc file
to do this before /usr is mounted ? The following files may be
places to put something like that: /etc/rcS.d/S30rootusr.sh,
/etc/rcS.d/S40standardmounts.sh [ It's not really what I was
after. I wanted there to be a problem to fix, which requires
the admin to run fsck, not to have fsck run automatically at
reboot time. -- Mike]
9. If it's ugly you want, jerking on the power cord and then
jerking on it again in the middle of reboot ought to do the
trick. Of course, controlled mayhem is probably available as
a package from from Sun. (hehehehe)
10. You might go to your hardware guys and see if they have a bad
hard drive somewhere that you can put in the test machine, and
let them play with it that way. Of course, they should never
be able to successfully fsck that drive, but that's real world.
11. Repartion the drive, overlap something on the /usr partition.
Newfs the new partition or use it for swap. Guaranteed to hose
whatever data is there. Don't ask me how I know :-)
12. Perhaps bonus points for someone who can repair a system with
a broken dynamic library system without resorting to booting
from CD - another machine available to copy stuff from helps.
Plus some comments on the validity of the test in general:
- This is kind of a difficult test, how are you going to base
pass/fail? What if you trash one machine to the point of it
being unrecoverable, and another where fsck recovers without
incident? Then you'll be hiring people based on which machine
they got and not their skills.
- If you have to have a hands on test for prospective employees,
why not give them a machine and have them bring up networking?
Or change the subnet/router/nameserver/nfs and have them
reconfigure by hand. Do something like move the libraries, so
they have to use the static binaries to recover the filesystem.
This would probably give you a better idea of their problem
solving skills.
- Wow -- this sounds like THE acid test for wanna-bes. If
somebody had shown me just a glimpse 20 years ago of all the
nasty things that might (and did) happen, I would have probably
chosen another vocation.
- I have always considered fsck to be a simple task interactively,
provided you remember to umount the drive. And I had a 36GB hard
drive I needed to fsck, where I found about about the '-y'
parameter (after 5 minutes of hitting y, Enter).
Several people asked for copies of the complete test once I was
done with it. Here it is. You'll notice where I have incorporated
some of the above suggestions into it. I didn't come up with a more
thorough test, as it was cancelled on me.
------------- begin -------------
Screen is blank
- output device is ttya instead of screen. admin needs to
fix eeprom setting.
System doesn't boot up, tries to boot from net.
- usually this means that the eeprom setting "boot-device"
is set to "net" rather than "disk". Instead, what happens
sometimes after an error is that the setting "diag-switch?"
is set to true rather than the default "false", which
means it looks in the setting "diag-device" to decide where
to boot from. "diag-device" is usually set to "net".
To fix, need to change the eeprom setting "diag-switch?"
from true to false.
System can't find boot block
- often happens after restoring the root filesystem, and
forgetting to install the bootblock on the disk. Admin
needs to boot from cdrom, and run the "installboot"
command on the appropriate disk. I can simulate this
condition by doing a dump and restore.
/usr needs an fsck
- the system can't boot up if it can't mount /usr, which it
can't do if there are errors in the filesystem. I can
manually corrupt the filesystem so it requires an fsck.
I want to ensure that /usr is separate from / so that it
doesn't conflict with the previous test.
- we can be really sneaky and also have the system try to mount
the wrong partition. This happens with a typo in the
vfstab file. Will need to use the "format" command to
find the correct filesystem to mount.
system hangs after a reboot
- bad entry in the /etc/system file. set maxusers=0.
Admin will have to boot from cdrom to fix.
can't login as root -- no shell
- this is a common problem when people improperly change
root's shell from /sbin/sh. Need to boot from cdrom
to fix.
user cannot login as root from any terminal
- caused by a space CONSOLE= setting in the
/etc/default/login file, boot -s to fix.
convert a system from complete standalone to fully networked.
- given the list of required info, such as NIS domainname,
NIS server name and IP, interface names and IPs, subnet
mask(s), default router, DNS domain name and name servers,
have the admin bring up networking from the ground up.
------------- end -------------
Finally, here is my original query:
--- On Dec. 15, 1999, Mike van der Velden <mvanderv@yahoo.com> wrote:
> Hello,
>
> I have to design a small "hands-on" test for some prospective
> system administrators. This test will include troubleshooting
> boot-up problems among other things.
>
> One thing I want to do is corrupt the /usr filesystem enough that
> an interactive fsck is required. I know that just powering off a
> system without a proper shutdown will require an fsck, but this
> usually happens automatically at the next bootup and fsck is able
> to fix it without much fuss.
>
> I want some fuss. Simply removing or renaming certain files will
> undoubtedly impair the bootup process, but it doesn't corrupt the
> filesystem in any way. I want the prospects to have to run fsck.
>
> Does anyone have a command that I can use to accomplish this? How
> can I create duplicate inodes or lost files? Or maybe you know of
> a bug in the OS that will trigger something like this.
>
> Or, perhaps I'm way off base, and I'm better off testing something
> else. Let me know.
>
> Thanks in advance for all your comments. I'll summarize after the
> tests are complete, just in case any of the prospects read this
> list as well. :)
>
> Thanks in advance.
> Mike van der Velden
Mike van der Velden
Insurance Corporation of British Columbia
=====
9 days to Y2k. 345 days until the new millenium.
_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:13:35 CDT