LONG SUMMARY: Disaster Recover/Backup System Disk

From: Phil Poole (poole@ncifcrf.gov)
Date: Fri Aug 15 1997 - 10:32:58 CDT

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

Howdy all.

        Sorry for the delay I've been pretty busy working on this and
        also starting to investigate Disk Suite and Veritas Volume
        Manager.

        I noticed a number of the Sun related newsgroups have
        received a flurry of questions along the same lines
        of disaster recovery, disk mirroring, and utilizing
        duplicate boot disks.

        My next related endeavour will be to create a generic
        JAZ Boot disk. Don't know if it's possible yet but
        I'm going to try.

        Got a few quick responces...and not one of them
        has mentioned Solaris DiskSuite or Veritas Volume Manager.

        My solution is right after the original post section.
        I also enclosed a few of the more detailed responces I
        received.

        I did learn ONE thing from my Sun Service Representative,
        apparently there was a design flaw in earlier Sun Sparc 20's
        and there are only two fans in the power supply.

        Later models of Sparc 20's contained 3 Fans in the power
        supply and those only containing two were retro-fitted
        with an additional fan located near the SCSI Disks.

        The additional fan was use to alleviate overheating
        problems that could result in a number of system
        malfunctions/errors. One of the overheating side
        effects was a problem description very similar to
        what I described below.

Original Problem:

Got to work at 8:45 AM 8/8/97 only to find that the primary
Mail, NIS, FTP..etc..etc..etc.. machine to be down.

        When attempting to boot, the system did NOT see the system
        disk. Every attempt at a boot from the 'ok>' prompt resulted
        in a slightly different error message. Each message was
        to the effect that /dev/dsk/c0t3d0sX is not responding or
        device not ready.

        A probe-scsi-all command resulted in listing all of the SCSI
        devices twice. 5 devices on c0 and 1 device on c1.
        2 1.05 GB Sun Internal Drives
        1 Internal Sun CD-ROM
        1 External HP 4mm DDS2 Tape
        1 External 4GB 3rd Party SCSI II Drive

1 External 9GB Drive.

        I finally did a power cycle on the system and that appears to
        have reset the SCSI Bus. Even though the device WAS showing
        up during the probe-scsi-all it strikes me a little odd
        that all of the devices were listed twice.

Perhaps the SCSI controller is going bad and the disk is fine.

        After the power cycle a loud grinding noise could be heard
        over the din of the computer room. So, maybe the disk really
        is going bad. Sun Engineers will be here Monday 8/11/97 to
        replace the (bad) disk.

        Since the power cycle the system disk has responded, no errors
        were logged via syslog. Of course there may have been some sort
        of SCSI contention not allowing writes.

        probe-scsi-all only listed the devices one time after the
        power cycle.

Original Post: (Intended for FUTURE use/reference)

>What is the most effective method of maintaining and restoring
>a redundant system disk.
>
>I have a key Sparc 20 server with two internal system disks.
>Both 1.05 GB.
>
>My primary is running solaris 2.5.1 and I just booted up my redundant
>and it is running Solaris 2.3.
>
>What would be the quickest method of duping the primary 2.5.1 disk to the
>backup 2.3 disk.
>
>I realize all partitions need to be the same size.
>
>
>Both disks are functional and I also have a complete Level 0 backup of
>the system disk.
>
>
>Here's my first thought:
>
> make all of the slices/partitions on the redundant disk
> /dev/dsk/c0t1d0sX the exact same size as /dev/dsk/c0t3d0sX
>
> Then use ufsdump and ufsrestore.
>
>/usr/sbin/ufsdump 0uf - /dev/dsk/c0t3d0sX \
> | (cd /mnt/junk; /usr/sbin/ufsrestore xf - )
>
>Would that work ? Would there still be a boot block ?
>
>Is there some method I can use to run suninstall and install a base
>Solaris 2.5.1 onto the redundant disk without having to shutdown and
>then 'boot cdrom' and choose /dev/dsk/c0t1d0sX as the install disk ?
>
>
>TIA
>
>Looks like time for a disaster recover section to the Sun-Managers
>FAQ.
>
>BTW if this is covered in the Faq I have a version from 4/15/1997
> and I don't have time currently to search through it.
> (301) 846-5721 | Frederick MD, 21702

What I did:

After the power cycle worked I was able to boot of off the
default system disk without any problems.

        1) Performed an immediate Level 0 Dump of the system disk.
                My system disk contains the following partitons
                /, swap, /usr, /usr/openwin, /var, and /export

/opt and /var/mail are on a separate disk for just
this purpose.

        2) Shutdown the system and booted my redundant disk.
           The redundant disk is /dev/dsk/c0t1d0s0.
           For MY prom mode I simply had to type boot disk1

           I seem to recall having to set up this disk1 alias
           a while ago, I do not think it is a standard default
           Solarisism.

           The steps for creating the NVRAM or PROM alias
           are detailed at the prom level. But in case
           you cannot set up an alias you should be able to
           just do the following

boot /iommu/sbus/espdma@4,8400000/esp@4,8800000/sd@X,0:a

           Where 'X' is the SCSI ID of the redundant boot disk.

           Likewise, this is the setting for a disk off of SCSI
           controller 0.

        3) 'ok> boot disk1 -s' and determined I was running Solaris 2.3
          Hm...that's kinda out of date so I decided to try running
          off of the system disk again (/dev/dsk/c0t3d0s0)

        4) From the system disk I ran the format command and then
           partitioned my redundant disk /dev/dsk/c0t1d0sX with
           partitions of the exact same size.

5) newfs'd each of the new partitions.

        6) Installed a boot block with the following command:
                /usr/platform/'/usr/sbin/uname -i'/lib/fs/ufs/bootblk \
                /dev/rdsk/c0t1d0s0

           /usr/sbin/uname -i responds with the correct platform.

        7) Then I did the following:
           mount /dev/dsk/c0t1d0s0 /mnt/junk

          ** /usr/sbin/ufsdump 0f - /dev/rdsk/c0t3d0s0 | \
                (cd /mnt/junk; ufsrestore xf - )

          **This command comes from the Solaris 2.5.1 man page for
            ufsrestore.

            I repeated step 7 for EACH of my file systems on my primary
           system disk:
                /, /usr, /var, /usr/openwin, /export

8) fsck'd each of the NEW filesystems after the dump/restore.

        9) Mounted /dev/dsk/c0t1d0s0 /mnt/etc and modified
           the necesarry entries in /mnt/etc/vfstab.
           These modifications are necessary to point to the new
           swap areas and system disk partitions. If you can do
           one subsitution then replace /dev/dsk/c0t3d0 with
           /dev/dsk/c0t1d0 ****

*** Remember we are changing TWO entries per line.
The device and the RAW device. (I almost forgot)

10) Shut-down the system and then booted disk1

The system came up without a glitch although I'm not sure
some of the permissions are correct in the / (root) partition.

What I expected to hear:

        Use Solstice DiskSuite to maintain a duplicate then there are
        two writes for every system disk update. ie: if /etc/hosts is
        modified then the modifications are automatic to the other
        redundant system device.

        All reads come from the primary system disk.

Alternative method:

If my system disk NEVER responded I would have been forced to
do the following:

        1) Boot cdrom
        2) Choose /dev/dsk/c0t1d0sX as the system install disk.
        3) Refer to my printed out hardcopy of the system
           configuration. To dupe the disk layout.

           (Good Sys Admins have one... :0) You do too don't you ?!!)

           The hard copy print outs have the partition sizes, what
           each partition correlates to as a filesystem. And contains
           the starting and ending sector/cylinder for each partition.

        4) Install Solaris 2.5.1
        5) Refer to my hardcopy PATCH print out regarding what patches
           are installed.

        6) Download and install the necessary patches.
        7) Install any local software. (read Sun Compilers and FDDI driver)
        8) Install any necessary Software patches.
        9) Re-install Legato Networker base.
       10) Restore data through Legato Networker from the backup archive.
       11) Read through list of local customizations :0)
           and duplicate where necessary.
       12) Test setup to make sure everything is functional.

Enclosed Responces:

Dave Haut Wrote:
Hi,

You are on the right track. Use ufsdump piped to ufsrestore.

Example:

# mount /dev/dsk/c0t0d0sx /mnt ( mount one of the old Sol2.3 partitions )
# cd /mnt
# ufsdump 0f - /dev/dsk/sol2.5.1part | ufsrestore vrf -

Also, You DO need to install the bootblock.

Check out the man page for installboot and use the example that is provided.

Mark Fromm wrote:
Greetings,

> Would that work ? Would there still be a boot block ?

You need to install the boot block after the fact.

This is how we do it here. We run this once a week from cron
during a quiet period of the machine

#!/sbin/sh
#
# Weekly dump of the operating system from c0t3d0s0
# to c0t1d0s0
# Modified on 01/14/97 to newfs drive and changed mount
# point from /mnt to /mntos.
# define variables
#
primaryosdisk="/dev/rdsk/c0t3d0s0"
secondaryosdisk="/dev/rdsk/c0t1d0s0"
blockdevicename="/dev/dsk/c0t1d0s0"
mountpoint="/mntos"
export primaryosdisk secondaryosdisk mountpoint blockdevicename
#
# Newfs the drive before dumping the OS.
/usr/sbin/newfs /dev/rdsk/c0t1d0s0 << EOF
y
EOF
/usr/sbin/fsck /dev/rdsk/c0t1d0s0
# mount the secondary O/S disk onto temporary mount point
#
#
mount $blockdevicename $mountpoint
#
# Dump the primary O/S disk to the secondary O/S disk
#
ufsdump 0f - $primaryosdisk | (cd $mountpoint; ufsrestore rf - )
#
# Install the boot block
# Sun 4m architecture only!
#
/usr/sbin/installboot /usr/platform/sun4m/lib/fs/ufs/bootblk $secondaryosdisk
#
rm $mountpoint/restoresymtable
umount $mountpoint

Additional scriptage I have running on a couple really critical
machines include having 2 vfstab files - (vfstab.c0t3d0s0 and
vfstab.c0t1d0s0) - the vfstab.c0t1d0s0 is setup with proper
swap and root partitions, the script copies
/mntos/etc/vfstab.c0t1d0s0 to /mntos/etc/vfstab so I have a
bootable disk ready to go without cleanup.

Hope that helps
Internet mail - mfromm@physio-control.com

Summary of responces for maintaining a redundant disk:

        1) Those that suggested 'dd' what options would you use to
           optimize the block size for such usage ?

           The last time I used 'dd' it took close to 8 hours to dump
           200 Megs. I must of had the wrong parameters. Although
           the new area did work without errors.

        2) Those suggesting using newfs and dump, that seems like
           quite a bit to do once a day, weekly or even monthly.

           I could see doing that, as I did, in certain cases but
           not on a regular basis.

        3) I like the suggest method of having the alternate devices
           mounted and then running a 'find -mtime -1 | cpio -p'
           or some derivative thereof. To simply copy the
           updated/newer files into the related redundant location.

        4) It would appear that the creation of the boot block
           can happen before or after the dump/restore.
           Most people suggested that the boot block get created
           last although I created it first right after running
           'newfs /dev/dsk/c0t1d0s0'. I do know that adding the
           boot block needs to be done AFTER the newfs but appears
           to be irrelevant in regards to data existing on the
           disk.

           Perhaps someone with more knowledge about disks and
           the SOLARIS filesystems could elaborate on this matter.

-- 
--
       Phil Poole	| Unix Systems Administrator
     poole@ncifcrf.gov	| Frederick Biomedical SuperComputing Center
      (301) 846-5721	| Frederick MD, 21702

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:12:00 CDT