SUMMARY: D1000 power failure with Disksuite: how to restore to running state?

From: David Graves <dsgraves_at_gmail.com> Date: Mon Feb 27 2006 - 22:20:16 EST · This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:56 EST

Many thanks to all who replied.  This story has a happy ending.

In a situation where an array loses power, and the server does not, each
disk that the system attempts to read will fail.  In a miror situation, it
is possible, then, to have multiple read fails.  For each failure that is
not fatal (i.e. the server still thinks there's an available mirror or
slice), it marks the disk in 'maintenance' mode.  When a read is attempted
from a the last available slice with a failed result, then that disk is
placed into the 'last erred' mode.

In a RAID 5 system, only 1 disk will enter the 'maintenance' mode.  The next
failed read places the failed disk into the 'last erred' mode, and the
entire metadevice is taken offline.  Further attempts at reads result in IO
errors.

As dersmythe pointed out, the Disksuite user manual makes reference a power
failure like this.  The procedure is to use the metareplace command with the
-e switch on the disk in 'maintenance' mode. And example is:  metareplace -e
dx cxtxdxsx (replacing the x's with the metadevice and slice that have
failed).

It is important to use this command first on the 'maintenance' disk before
attempting to enable the 'last erred' disk.

I personally ran into a problem with this method: execution of metareplace
-e failed and reported that I must use the -f (force) switch.  Feeling
uncomfortable with proceeding, I held off to do more research.

A SECOND method of recovery is available as well: it is possible to CLEAR
the metadevice with the metaclear command, and rebuild it, as long as the
slices/disks are ordered just as they were prior to failure.  The order is
revealed with the metastat -p command.  The manual recommends using this
method only when the 1st method is unsuccessful.  All metadevices (mirrors,
raids, concats, etc) can be rebuilt in this fashion according to the manual.

There are valuable documents on docs.sun.com as well as an article on
sunsolve.sun.com referring to the second method of recovery.  Sunsolve
requires a subscription.

As it happens, I employed neither of these procedures (unintentionally).  In
my case, a coincidental memory error caused a panic (I'm wondering if this
was due to the original power outage as this machine is quite stable).
Upon reboot, the metadevice was online and the 'last erred' disk was
cleared.  I used the metareplace -e command to clear the 'maintenance' disk
and all was well.

Thanks again to Prabir Sarkar, dersmythe, Michael T Pins, and Damian Wiest

-dave

---------- Forwarded message ----------
From: David Graves <dsgraves@gmail.com>
Date: Feb 21, 2006 10:11 PM
Subject: D1000 power failure with Disksuite: how to restore to running
state?
To: sunmanagers@sunmanagers.org

I have an Ultra 30 connected to a D1000 with 2 controllers (D1000 is split
in 2, 6 disks to a controller).  All the disks are configured as one Raid 5
metadevice.

I experienced what appears to be a power glitch: enough to power down the
D1000, but not powering down the server.

This is my guess what happened next: The server tried to write to the first
disk, and, being unable to, marked it 'maintenance' . The next disk write
produced a 'last erred' error, and took the metadevice (Raid 5) offline.

I powered back up the d1000 and it came back to life.

My _guess_ is that the data is intact.

A metadb shows the following:

flags          first blk   block cnt
Wm pc l     16           1034   /dev/dsk/c3t4d0s0
W    pc l     16           1034   /dev/dsk/c4t1d0s0
a     pc luo  16           1034   /dev/dsk/c0t1d0s7
a     pc luo  1050       1034   /dev/dsk/c0d0s7
a     pc luo  2084       1034   /dev/dsk/c0t1d0s7

and a metastat shows:

c3t1d0s0 okay
c3t3d0s0 Maintenance
c3t4d0s0 okay
cdt5d0s0 okay
c4t0d0s0 okay
c4t1d0s0 okay
c4t3d0s0 last erred
c4t4d0s0 okay

QUESTION:  While I have backups, they're older than I'd like them to be, and
I have reason to believe that the only thing wrong here is that the D1000
powered down, and the data on the array is good.   WHAT is the best way to
attempt to recover?   Is using the metainit -k command a safe way to
proceed?   Is that the best way to proceed?
Will post all answers in summary.

TIA

-dave
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers