SUMMARY: RE Replacing mirrored Sun system disk

From: Clive Elsum <Clive.Elsum_at_CSIRO.AU>
Date: Wed Jul 10 2002 - 23:35:54 EDT
Many thanks to the following for ideas and suggestions, and the quick 
responses.

From: Tim Hespe <t.hespe@unsw.edu.au>
From: Dan Astoorian <djast@cs.toronto.edu>
From: Scott Croft <secroft@micron.com>
From: Kristian Styrvoll <kristian.styrvoll@eterra.no>
From: Tony Walsh - Field Service Engineer <Tony.Walsh@Sun.COM>
From: Matthew Stier <Matthew.Stier@fnc.fujitsu.com>
From: "Thomas M. Payerle" <payerle@physics.umd.edu>
From: "Kevin Buterbaugh" <Kevin.Buterbaugh@lifeway.com>
From: "Mortensen, Henrik" <henrik.mortensen@csfb.com>
From: "Keplinger, Michael A" <michael.keplinger@nmci-isf.com>
From: Eric Shafto <eshafto@mac.com>

I ended up using a combination of these ideas and summarize as follows:

metadb -i  (to check that the metadb's are on both disk slices s7)
prtvtoc /dev/rdsk/c0t1d0s2 > /var/adm/doc/20021007.c0t1d0s2.vtoc
metadetach d0 d2  (this failed metadetach: : d0: attempt an operation on a 
submirror that has erred 
components)
metadetach d10 d12
metadetach d20 d22
metadetach d30 d32
metadb -d c0t1d0s7
Pulled the old disk 
Put the new disk
fmthard -s /var/adm/doc/20021007.c0t1d0s2.vtoc /dev/rdsk/c0t1d0s2
metattach d0 d2 (failed metattach: : d2: invalid unit)
metattach d30 d32
metattach d20 d22
metattach d10 d12
metadb -a -c 3 c0t1d0s7
To get over the d0 d2 problem:
metareplace -e d0 c0t1d0s0

After all the syncing we finally have a sytem back again!!!

Once again many thanks to all,

Clive Elsum



**************************************************************************
From: Tim Hespe <t.hespe@unsw.edu.au>

         I have a book, one of the Sun Blueprint series, called "Boot Disk 
Management" which
covers exactly this scenario. It is well worth getting hold of. It covers 
both Disk Suite and Veritas
setups.
You seem to be on the right track. The only thing you don't seem to have 
taken into
account is the removal and restoration of metaDB replicas. The sequence of 
events is basically ;
metadetach

metaclear

metadb - d           # to remove metaDB replicas on the affected drive

use format or fmthard (using a previoulsy saved copy of the vtoc) to slice 
the disk

metainit

metattach

metadb -a           # to create metaDB replicas on the new disk

As you have shown in your procedure, metareplace can be used instead of the 
metaclear->metainit->metattach
sequence of commands, but for some reason they don't use it. Probably for 
the sake of clarity.
If you give me your fax number I can send the pages from the book (4) 
rather than me trying to paraphrase
them.

**************************************************************************
From: Dan Astoorian <djast@cs.toronto.edu>

You have two copies of the data for /, and both of them have reported
errors.  There are no guarantees: mirroring does not protect against
failures of both copies of your data.

You may wish to run "metastat -t" to see how long each submirror has
been offline.  (It's possible that d1 failed a long time ago, and nobody
noticed.)

You almost certainly don't want to use metaonline and metaoffline.  When
you use metaoffline, the system keeps track of updates to the other
mirror, so it knows which blocks to update when it's brought online.  If
you replace the disk, the system will assume that any blocks that
haven't changed since the disk was taken offline are still in sync.
Since they're not, you'll get data corruption.

> metaoffline d0 d2

I would venture to guess that this command will fail; so would
"metadetach."

d2 is the "last erred" copy of the data, which means it's less outdated
than the data on d1.

Consult the DiskSuite 4.2.1 User's Manual, available from docs.sun.com.
In particular, see page 133 ("Submirror States").

What I would try is:

Attach a working disk at SCSI target 2, format it the same as sd0, and
try:
    metareplace d0 c0t0d0s0 c0t2d0s0

as per the "invoke" command in the metstat output.

If I couldn't attach three disks at the same time, I would remove the
disk c0t0d0 (after first metadetaching d12, d21, and d32), and instead
use the command:
    metareplace -e d0 c0t0d0s0

Be warned, however, that the system may not allow you to do even this
metareplace command, because there is no error-free copy of your data
anywhere on the system for DiskSuite to use to write a new copy of the
mirror.  In such a case, you may have to metaclear the metadevices and
re-metainit them.  Unfortunately, you can't do that while the
metadevices are in use.

You may ultimately need to go to your backup tapes, and/or reinstall
your operating system.

**************************************************************************
From: Scott Croft <secroft@micron.com>

We have used the metadetach for detaching the mirrors, remove any hot
spare devices, remove the copy of the database on the bad disk if you
put it there, take the system down, replace the disk. reboot the system
with a -r (shouldn't have to run devfsadm). Copy format from primary
disk to secondary, re-create the database on the new disk, metattach and
you should be done. Run metastat to see progress.

**************************************************************************
From: Kristian Styrvoll <kristian.styrvoll@eterra.no>

Go to http://docs.sun.com  go to Solstice DiskSuite 4.2.1 User's Guide,
search for How to Recover From a Boot Device Failure (Command Line)

It works for me.

**************************************************************************
From: Tony Walsh - Field Service Engineer <Tony.Walsh@Sun.COM>


I have not seen a summary regarding this, so I will proffer the 
following advice:-
1 - Use 'metadetach <mirror> <submirror>' for ALL slices on the faulty 
drive not already in 'Needs Maintenance' state.
2 - If there are metadb datasets on the failing drive (which there 
should be), they need to be removed with the
'metadb -d <component>' command (eg. metadb -d /dev/dsk/c0t1d0s7 to 
remove al metadb's in that slice).
3 - Remove faulty drive and replace with new drive. The E420R uses 
hot-swappable drives so no power outage is required.
4 - Reformat the drive by copying the VTOC of the good drive onto the 
new drive. Use the format command to copy the VTOC
or prtvtoc output as input to fmthard.
5 - Use 'metareplace -e <mirror> <component>' for each slice to be 
mirrored again. (eg. metareplace -e d0 /dev/dsk/c0t1d0s0)
6 - Re-establish metadb's on new drive with 'metadb <options> 
<component>' comand (eg. metadb -c 3 /dev/dsk/c0t1d0s7)

Step 5 will take some time to complete the synchronisation process, but 
step 6 does not have to wait. You should also wait
for the sync process to finish and then schedule a reboot at your 
convenience if possible.

If you only have the 2 drives in this system, then it is recommended 
that you have 3 metadb's on each drive  so that you will
always have a quorum should one drive completely fail. These metadb's 
are usually put on slice 7 but any spare slice with at
least 30 MB available (for SDS 4.2.1) is recommended (30 MB is the 
maximum required but this configuration could get away
with 10MB as a minimum). If you don't have 3 metadb's on the good drive,

fix that first before carrying on with this process.

Your steps would therefore be as follows:-

metadetach d0 d2 (if it has not already done so)
metadetach d10 d12
metadetach d20 d21
metadetach d30 d32
metadb -d c0t1d0s7 (assuming you have metadb's on this slice)
Replace disk "hot swap" NO POWER OFF
Format the disk as per prtvtoc of old disk
metareplace -e d0 c0t1d0s0
metareplace -e d10 c0t1d0s1
metareplace -e d20 c0t1d0s3
metareplace -e d30 c0t1d0s4
metadb -c 3 c0t1d0s7 (if they came from this slice previously)

**************************************************************************
From: Matthew Stier <Matthew.Stier@fnc.fujitsu.com>

metaoffline/metaonline, expects that the metapartition being offline'd/online'd
basically hasn't changed, and that only changes recorded since the offlining 
need
to be run against the offline'd partition.  This is not what you want.

You need to metadetach and metaclear all metapartitions that are present on that
drive.

Once all the paritions are clear, you can:

1) Replace the drive

2) Partition it

3) Metainit the metapartitions on the drive.

4) Metattach the metapartitons to recreate your mirrors.

Once the metattach has completed syncing, the task will be finished.

**************************************************************************
From: "Thomas M. Payerle" <payerle@physics.umd.edu>

I find the advise on 
http://www.slacksite.com/solaris/disksuite/SDSrecovery.html
to be pretty good.  Believe I even gave it a test run once.  IT also may be
excessive as is referring to boot/root devices, and assumes a 2 disk mirror
setup (wherein there are complications as you will only have half, not half+1
metadb replicas up).

I believe the metadetach, format, (metattach), metareplace works (I won't
say correct procedure cause the other may work as well).  
The question about power on or off depends more on the hardware than on
disksuite --- is your hardware hot swappable.  IF is not, or are unsure,
you should power down after detaching, replace the drive, and power back up.
IF hot swappable, can just replace after power up.

If the replacement drive is ID'ed like the original (e.g. same SCSI chain, 
ID, etc., e.g. the /dev/cntndnsn names would be unchanged), I don't believe
metattach is needed, and you can just do a metareplace -e (see metareplace
man page).  The drive must be labelled first to match to old prtvtoc info
(actual, probably the partition affected just needs to be same size, but usually
mirroring entire drives, same effect).

**************************************************************************
From: "Kevin Buterbaugh" <Kevin.Buterbaugh@lifeway.com>


     Sun's short procedure is wrong.  Your procedure is correct, replacing
the metaoffline and onlines with metadetach and metattach, respectively.

     As an aside, Sun does have an excellent "Blueprints" book on this
(covers mirroring the root disk with both DiskSuite and Veritas).  It's
called "Sun Blueprints Guide to High Availability" by Kobert.  Well worth
it, IMHO...


**************************************************************************
From: "Kevin Buterbaugh" <Kevin.Buterbaugh@lifeway.com>

metaonline only makes sense when the disk you're online'ing 
already contains most of the data for the mirror (i.e. if 
that was the disk previously metaoffline'd); when swapping
you need a full sync-up.  

I'd yank c0t0d0 (you can metadetach all mirrors on that disk 
if you don't trust ODS, but I've never needed to); swap c0t0d0; 
prtvtoc /dev/dsk/c0t1d0s2 | fmthard -s - /dev/rdsk/c0t0s0d2 (if 
you have the same geometry, otherwise do it manually with format 
or re-label); metareplace -e all the now broken mirrors (or the 
one broken and metattach the rest if you metadetached them).

To fix the last-err'ed slice, you can initially try to 
metareplace -e it as it is (since both disks are on the same 
scsi bus the error might be bus related).  If it still fails, 
you'll have to do the above for c0t1d0 as well.

**************************************************************************
From: "Keplinger, Michael A" <michael.keplinger@nmci-isf.com>


I recently had some similar problems.  Are you looking to replace the
whole drive?  If so, you should be able to do so with just the
metareplace -e command.  However from looking at your md.tab file it
doesn't look like your second mirror is attached.

Verify this with metastat -p
but the output should look like this for d0, note the main difference in
bold
d0 -m d1 d2 1
d1 1 1 c0t0d0s0
d2 1 1 c0t1d0s0
 
If this is okay and you are planning on replacing the whole drive then I
am pretty sure the best way to do it is to just swap it out then copy
over the partition table.  You don't even need to create filesystems.
 
Then for each of the mirrors run the following command
metareplace -e d0 c0t1d0s0
metareplace -e d10 c0t1d0s1
metareplace -e d20 c0t1d0s3
metareplace -e d30 c0t1d0s4
 
You will notice that the first parameter after the -e flag is the mirror
name not the submirror name
This will cause the mirrors to resync, unless they weren't attached to
begin with, in which case you will want to run 
metattach d0 d2
metattach d10 d12
metattach d20 d22
metattach d30 d32
 
Once you are done with this you will want to recreate the metadevice
database on that disk
See what databases you have with the metadb command
then for the databases on that disk you will want to destroy them and
then recreate them
metadb -d c0t1d0s?
metadb -a -c3 c0t1d0s?  (I usually put 3 copies of the DB on each disk
when there are only 2 disks)
 
**************************************************************************
From: Eric Shafto <eshafto@mac.com>

If you're replacing it with the same disk model in the same slot at the same
SCSI ID, you don't need to do the devfsadm or drvconfig or disks.

I've done this before several times. There may be a quicker way to do it but
here's what worked for me:

1. metadb to remove the metadbs from the failing disk.
2. metadetach each of the submirrors on the failing disk.
3. shut down and replace the disk.
4. format, replicate the partition table from the good disk to the new disk.
5. metadb to add the metadbs to the new disk
6. metattach each of the submirrors on the new disk (you don't have to
create them, since they don't really exist on the disk. Simply having them
in the metadb is sufficient).
7. installboot on disk 2 (saves you a boot cdrom when disk 1 fails).

Before doing step 1, make sure you don't leave yourself with too few
metadbs. If you have only one or two on each disk, and you don't have any on
any other disks, then you don't have enough and your recovery will be more
complicated. Never leave yourself with less than three metadbs.

**************************************************************************


 
---------------------------------------------------------------------
Clive Elsum BAppSc, RHCE
Systems Engineer - Information Technology Group
CSIRO Atmospheric Research
PMB 1, Aspendale, Victoria, Australia  3195
Phone : (+61 3) 9239 4509
Fax:    (+61 3) 9239 4444
E-mail Clive.Elsum@csiro.au
---------------------------------------------------------------------

Original question:

Hi ,

I am having problems getting a definitive approach to replacing a mirrored 
system disk on our Sun 420R.

We are running Solaris 8 on a Sun 420R with 2 18Gb disks mirrored via
Disksuite 2.4.1. The second disk is showing errors and needs to be replaced.
The problem is I keep getting conflicting information on the correct procedure.
Sun basically gave "short shift" saying use metaoffline, metaonline, 
metareplace.

1 - use the command metaoffline <mirror name> ...to offline the mirror
(the secondary one. )
2 - Shutdown and replace the faulty disk and run devfsadm or drvconfig ; disks 
3 - Up the system and run the command metaonline <mirror name>
4 - when disks are synced run the command metareplace -e 
The mirror will then eventually recover .

This does not seem correct, as metaonline would enable at bootup and a boot -r
would reconfigure the disks etc. Also no mention of formatting the disk.

Other stuff I have looked at indicate metadetach then replace faulty disk
(some say power down others say on-line) format the disk as per failed
disk prtvtoc, then metattach, then metareplace.


I really need a definitive method of attack that will work. 

Given the md.tab file is:
#       Mirror for /
#
d0      -m      d1
d1      1 1 /dev/dsk/c0t0d0s0
d2      1 1 /dev/dsk/c0t1d0s0
#
#
#       Mirror for swap
#
d10     -m      d11
d11     1 1 /dev/dsk/c0t0d0s1
d12     1 1 /dev/dsk/c0t1d0s1
#
#
#       Mirror for /usr/local
#
d20     -m      d21
d21     1 1 /dev/dsk/c0t0d0s3
d22     1 1 /dev/dsk/c0t1d0s3
#
#
#       Mirror for /it
#
d30     -m      d31
d31     1 1 /dev/dsk/c0t0d0s4
d32     1 1 /dev/dsk/c0t1d0s4


Would the correct procedure be:

metaoffline d0 d2
metaoffline d10 d12
metaoffline d20 d21
metaoffline d30 d32
Replace disk "hot swap" NO POWER OFF
Format the disk as per prtvtoc of old disk
metaonline d0 d2
metaonline d10 d12
metaonline d20 d22
metaonline d30 d32
metareplace -e d2 c0t1d0s0
metareplace -e d12 c0t1d0s1
metareplace -e d22 c0t1d0s3
metareplace -e d32 c0t1d0s4

OR do I replace metaoffline with metadetach and metaonline with metattach
and if so are there any other steps I am missing.

Much thanks in advance

Clive


Output info shows:


# iostat -E
sd0      Soft Errors: 48 Hard Errors: 0 Transport Errors: 0 
Vendor: IBM      Product: DDYST1835SUN18G  Revision: S96H Serial No: 157444           
Size: 18.11GB <18110967808 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 48 Predictive Failure Analysis: 0 
sd1      Soft Errors: 48 Hard Errors: 35 Transport Errors: 16 
Vendor: IBM      Product: DDYST1835SUN18G  Revision: S96H Serial No: 10K705           
Size: 18.11GB <18110967808 bytes>
Media Error: 30 Device Not Ready: 0 No Device: 5 Recoverable: 0 
Illegal Request: 48 Predictive Failure Analysis: 0 
sd6      Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: TOSHIBA  Product: DVD-ROM SD-M1401 Revision: 1007 Serial No: 06/22/00 
Size: 18446744073.71GB <-1 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd30     Soft Errors: 1 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: OPENstorage 9176 Revision: 0401 Serial No: 1T03310196       
Size: 365.06GB <365061079040 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 1 Predictive Failure Analysis: 0 
sd46     Soft Errors: 0 Hard Errors: 1 Transport Errors: 0 
Vendor: STK      Product: OPENstorage 9176 Revision: 0401 Serial No: 1T02811801       
Size: 365.06GB <365061079040 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 1 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd68     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: OPENstorage 9176 Revision: 0401 Serial No: 1T03310196       
Size: 220.09GB <220091908096 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd74     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: OPENstorage 9176 Revision: 0401 Serial No: 1T02811801       
Size: 220.09GB <220091908096 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd330    Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: Universal Xport  Revision: 0401 Serial No: 1T03310196       
Size: 0.02GB <18874368 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
sd474    Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: Universal Xport  Revision: 0401 Serial No: 1T02811801       
Size: 0.02GB <18874368 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 0 Predictive Failure Analysis: 0 
st15     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: 9840             Revision: 1.30 Serial No: .109 
st16     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: 9840             Revision: 1.30 Serial No: .109 
st17     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: 9840             Revision: 1.30 Serial No: .109 
st18     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: T9940A           Revision: 1.30 Serial No: .210 
st19     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: T9940A           Revision: 1.30 Serial No: .210 
st20     Soft Errors: 0 Hard Errors: 0 Transport Errors: 0 
Vendor: STK      Product: T9940A           Revision: 1.30 Serial No: .210 


# metastat
d0: Mirror
    Submirror 0: d1
      State: Needs maintenance 
    Submirror 1: d2
      State: Needs maintenance 
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 16779432 blocks

d1: Submirror of d0
    State: Needs maintenance 
    Invoke: metareplace d0 c0t0d0s0 <new device>
    Size: 16779432 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t0d0s0                   0     No    Maintenance  


d2: Submirror of d0
    State: Needs maintenance 
    Invoke: after replacing "Maintenance" components:
                metareplace d0 c0t1d0s0 <new device>
    Size: 16779432 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t1d0s0                   0     No    Last Erred   


d10: Mirror
    Submirror 0: d11
      State: Okay         
    Submirror 1: d12
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 4198392 blocks

d11: Submirror of d10
    State: Okay         
    Size: 4198392 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t0d0s1                   0     No    Okay         


d12: Submirror of d10
    State: Okay         
    Size: 4198392 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t1d0s1                   0     No    Okay         


d20: Mirror
    Submirror 0: d21
      State: Okay         
    Submirror 1: d22
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 8392072 blocks

d21: Submirror of d20
    State: Okay         
    Size: 8392072 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t0d0s3                   0     No    Okay         


d22: Submirror of d20
    State: Okay         
    Size: 8392072 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t1d0s3                   0     No    Okay         


d30: Mirror
    Submirror 0: d31
      State: Okay         
    Submirror 1: d32
      State: Okay         
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 5955968 blocks

d31: Submirror of d30
    State: Okay         
    Size: 5955968 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t0d0s4                   0     No    Okay         


d32: Submirror of d30
    State: Okay         
    Size: 5955968 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c0t1d0s4                   0     No    Okay         


# prtvtoc /dev/rdsk/c0t1d0s0
* /dev/rdsk/c0t1d0s0 partition map
*
* Dimensions:
*     512 bytes/sector
*     248 sectors/track
*      19 tracks/cylinder
*    4712 sectors/cylinder
*    7508 cylinders
*    7506 accessible cylinders
*
* Flags:
*   1: unmountable
*  10: read-only
*
*                          First     Sector    Last
* Partition  Tag  Flags    Sector     Count    Sector  Mount Directory
       0      2    00          0  16779432  16779431
       1      3    01   16779432   4198392  20977823
       2      5    00          0  35368272  35368271
       3      4    00   20977824   8392072  29369895
       4      0    00   29369896   5955968  35325863
       7      0    00   35325864     42408  35368271
# 


Thanks in advance

Clive
---------------------------------------------------------------------
Clive Elsum BAppSc, RHCE
Systems Engineer - Information Technology Group
CSIRO Atmospheric Research
PMB 1, Aspendale, Victoria, Australia  3195
Phone : (+61 3) 9239 4509
Fax:    (+61 3) 9239 4444
E-mail Clive.Elsum@csiro.au
---------------------------------------------------------------------
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Wed Jul 10 23:47:03 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:48 EST