SUMMARY: Off-line it, or Detach it?

From: Mark A. Bialik <mbialik_at_infinityhealthcare.com>
Date: Tue May 28 2002 - 16:08:55 EDT
Thanks to:

Guy Purcell, Scott Howard, Neil Harrison, James Brown, Dan Lorenzini,
Tom Payerle, Gregg Mackenzie, Richard Eisenman, and John Eisenschmidt.

Turns out the disk must have been bad.  Following the advice below,
I tried to metareplace the mirrors with themselves, but the resync
failed and I started getting SCSI errors.

So, I metadetached the mirrors on the problem disk, shutdown,
slapped in a new disk, partitioned the new disk with the same slice
info (FYI: I tried a fmthard with the info from the failed drive, but
since the new disk was a different type/geometry, this failed. So, I
recreated the partitions by hand... makeing sure they were slightly
larger than the old partitions).  Then, simply re-attaching the
mirrors rebuilt the info.

Thanks very much to everyone for their help.

Mark

=======================================================================
Sometimes, but it almost never tells you what the problem was, so you 
won't know how to fix it the next time it happens.

Personally, I'd try an approach somewhere in between 2 & 3 first.  If 
the disk has physical problems, then #2 is a waste of time.  But if the 
problems aren't severe enough to require replacement, then #3 is 
overkill--at least for now.  (If the problems are physical, I'd 
definitely want the disk replaced; it's just better to do replacements 
when you _want_ to than when you _have_ to.)

I'd metadetach the submirrors on the bad disk (all of 'em).  Then 
reformat the disk to find/remove bad regions.  And finally, metattach 
the submirrors again.  All of that can be done without taking any 
services down.  If format reports tons of bad blocks, or you continue to 
see SCSI errors, replace the disk.  You don't say what system houses the 
disk in question; if it's hot-swappable, you should be able to do a 
complete disk replacement & mirror resync while the system is up & 
running.

--
Guy
(guy@extragalactic.net)

=======================================================================
There's two real options you can take here...

1. Reattached the mirrors. The best way do do this is with metareplace 
:
        metareplace -e d2 c2t1d0s0
        metareplace -e d8 c2t1d0s3

2. Swap the disk.

Personally, I'd go for number 1 and see what happens.  If the disk
really is
bad, it will fail again either during the resync or shortly afterwards,
at
which point you'll be no worse off than you are now and you'll have to
take
options 2.

  Scott

=======================================================================
First thing to try with your disksuite problem would be to do a vurtual
replace of the dodgy metadevices

i.e.

for d1 submirror  do "metareplace -e d2 c2t1d0s0"
for d7 sunmirror do "metareplace -e d8 c2t1d0s3 "

A "metastat" command should show the mirrors syncing, 
there is no need to reboot....


Hope this helps

Neil Harrison

=======================================================================
Yes I have that happen all the time I don't know why. Just a slice of a
disk
will go off line but other sices are fine. This is how you correct it.

1) Bring the mirror into the main window

2) Right Click on the offeneding slice that is offline and click on
info.

3) Click on ENABLE

4) Commit transacction

If all is fine with the disk is should begin mirroring again and all
will be
fine. I hope this is actually your problem. In fact I just did it 5
minutes ago
myself.

<James Brown>

=======================================================================
The first thing I would try is to use format(1M) to "repair" the disk.
The safest way to do this is to use the "read" command of the "analyze"
menu.  If it finds a bad block it will attempt to "repair" it (actually
it maps it to a spare sector).  This is the default behavior unless you
change it using the "setup" command.  If the read pass goes through
without errors you might consider doing one of the write options.  In
this case you can use setup to limit the range of the test to the
affected partitions.  Since they are in "maintenance" mode, disksuite
will not be updating them while you run your test.

I have used this many times with good success.  However, sometimes it
does not work, so you must replace the disk.  If that is the case, you
need to use metadetach rather than metaoffline for all metadevices on
the affected drive, and then metareplace -e after the new drive is
installed and properly partitioned.

Regards,

Dan Lorenzini           Greenwich Capital Markets

=======================================================================
Don't believe this will work, but who knows.  Reminds me of the old joke
of what an IT person does when they get a flat tire--- turn off and
restart
the car to see if it goes away.

Assuming the disk is OK, I believe this will solve the problem.  You
could
also go a bit further; detach the mirror, then re-init the mirror and
reattach.
I would probably do the re-init since isn't much more work, and should
really
clean up any data corruption issues (assuming a good disk).  

There should not be any problem doing this even on root.  After all, the 
mirrors are bad, so should not be in use by anything anyway.  Even if
were
in use, this is the point of mirroring.

The question is whether the old disk is bad or not, and whether the cost
of a new disk exceeds the cost of a possible disk failure.  Since you
are
mirroring to begin with, sounds like an important system, and I would
tend to 
replace the disk (I might put the old disk to duty in a less critical
situation).

BTW, you should be able to offline the working mirrors on the problem
disk,
replace the disk (if not hot swappable, will require rebooting.  You
should
ensure that you have more than 50% of the database replicas on other
disks
before rebooting, and delete the replicas on the problem disk).  Then
run
metareplace for each of the mirrors and should start resyncing.

<Tom Payerle>

=======================================================================
I would be inclined to first try a fourth option:

 - metareplace the "bad" submirrors in place:

        metareplace -e d2 c2t1d0s0
        metareplace -e d8 c2t1d0s3

If disksuite kicks it/them back out again, you
probably do have something wrong with the disk,
but you could also try option #5:

 - detach/unmirror the bad submirrors (it's been
   awhile since I've had to try this, so I can't
   remember if it will let you detach a bad
   submirror...maybe with the -f option...I dunno);
 - metaclear the bad submirrors;
 - either fsck or newfs (your choice) the bad
   partitions, the idea being to "clean up" any
   residual filesystem bugginess;
 - metainit the bad submirrors;
 - metattach the new submirrors.

If that doesn't work, option #2 would be my next
choice, then option #3.

Option #1 doesn't work because the mddb retains its
state between reboots.  It would still think that
the components are bad.

Good luck.

Gregg Mackenzie

=======================================================================
I would probably do:

Detach on failing disk (metadetach -f ...)

Clear failing disk (metaclear ...)

Get rid of Replica Dbs on failing disk (metadb -d ....)

Edit /etc/vfstab and change back to standard device names

Shutdown, remove failing disk, reboot (and check that everything comes
up
OK)

Shutdown, put in a new disk (be sure its clean; if it happens to have
Replica Dbs on it from some other previous configuration you may have
some
trouble). Reboot.

Setup the mirror configuration again ...

Richard Eisenman

=======================================================================
This might be a little late, but I thought it might help.

We have some V880s running DiskSuite 4.2, and we've seen one quirk. When
we were building
the systems and we were rebooting a lot (screwing with kernel
parameters) we found if the
system came up and the mirrors were out of sync despite a notmal reboot,
they would be out
of sync every time we rebooted. So we'd reboot, DS would tell us they
need maintenance, we'd
metareplace the disk with itself, let it do a full rebuild until DS said
they were
consistant, then reboot again and the same disk would be out of sync. If
we detached the
mirror and reattached (letting it sync obviously) it would fix the
problem and every reboot
after that would come up clean. Strange, but I've seen it on a couple
different Solaris
installs on a couple different boxes.

Aside from that, DS is great. If you're still having problems it might
be worth detaching
and reattaching the mirror to see if it fixes the problem before you do
something crazy like
reboot.

Best,
John 

=======================================================================

Original Question:

Hello:

I have a problem with a DiskSuite 4.2 mirror, and I'd like some advice
on how to tackle the problem.  I have a few two-way mirrors.  I recently
discovered that some of the sub-mirrors went into a
"Maintenance/Critical" state.  One mirror is mounted as / and the other
/var

In each case, the failed sub-mirror is on the same disk.  However, the
same disk also has another submirror which is working just fine, so I'm
guessing the disk may not actually be bad (then again, it could become a
problem).

I have included my metastat, metadb, and syslog output detailing the
errors at the bottom of this email.  In each instance, the bad submirror
is on c2t1d0.  The Metadb I also had on this disk is bad, but I've got
six other ones spread across two other controllers.

My question is this:  What is my best approach?  I can see three
options:

1) Reboot and hope the problem clears itself up  :)  Does this sctually
work sometimes?

2) Offline the submirrors and then "online" them.  Since one of the
submirrors is for / I'm not exactly sure if this is a good idea.  If it
matters, the problem disk is not the primary boot disk.  Is this a good
option to try before breaking the root mirror and going through the
hassle?

3) Detach/Unmirror the root, reboot, edit the correct files, come up
unmirrored, slap in a new disk, etc.

Again, I'm not sure the disk is actually bad since another submirror is
OK.  But there could be some bad sectors.

This is my first problem under DiskSuite in about two years, so I guess
I;ve been pretty lucky.  It obviously saved my butt, and I don't want to
make matters worse by doing something stupid.  Any help is  greatly
appeciated.  I have an hour of scheduled downtime starting in about 8
hours :)

Will summarize.

Thanks very much,
Mark


# metadb -i
        flags           first blk       block count
     a m  p  luo        16              1034           
/dev/dsk/c0t0d0s7
     a    p  luo        16              1034           
/dev/dsk/c0t1d0s7
     a    p  luo        16              1034           
/dev/dsk/c0t2d0s7
     a    p  luo        16              1034           
/dev/dsk/c1t0d0s7
     a    p  luo        16              1034           
/dev/dsk/c1t1d0s7
     a    p  luo        16              1034           
/dev/dsk/c1t2d0s7
      W   p  l          16              1034           
/dev/dsk/c2t1d0s7

# metastat | more
d2: Mirror
    Submirror 0: d0
      State: Okay
    Submirror 1: d1
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 24578400 blocks
 
d0: Submirror of d2
    State: Okay
    Size: 24578400 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c2t0d0s0                   0     No    Okay
 
 
d1: Submirror of d2
    State: Needs maintenance
    Invoke: metareplace d2 c2t1d0s0 <new device>
    Size: 35549760 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c2t1d0s0                   0     No    Maintenance

d8: Mirror
    Submirror 0: d6
      State: Okay
    Submirror 1: d7
      State: Needs maintenance
    Pass: 1
    Read option: roundrobin (default)
    Write option: parallel (default)
    Size: 4097920 blocks
 
d6: Submirror of d8
    State: Okay
    Size: 4097920 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c2t0d0s3                   0     No    Okay
 
 
d7: Submirror of d8
    State: Needs maintenance
    Invoke: metareplace d8 c2t1d0s3 <new device>
    Size: 4097920 blocks
    Stripe 0:
        Device              Start Block  Dbase State        Hot Spare
        c2t1d0s3                   0     No    Maintenance




May  9 08:57:22 emsdb3 scsi: [ID 107833 kern.warning] WARNING:
/pci@4,2000/scsi@1/sd@1,0 (sd46):
May  9 08:57:22 emsdb3  SCSI transport failed: reason 'incomplete':
retrying command
May  9 08:58:27 emsdb3 scsi: [ID 365881 kern.info] /pci@4,2000/scsi@1
(glm3):
May  9 08:58:27 emsdb3  Cmd (0x708fc320) dump for Target 1 Lun 0:

May  9 08:58:51 emsdb3 scsi: [ID 107833 kern.warning] WARNING:
/pci@4,2000/scsi@1/sd@1,0 (sd46):
May  9 08:58:51 emsdb3  Error for Command: write(10)               Error
Level: Fatal
May  9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice]    Requested Block:
12028560                  Error Block: 12028560
May  9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice]    Vendor:
SEAGATE                            Serial Number: 3AK0E8CY
May  9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice]    Sense Key: Not
Ready
May  9 08:58:51 emsdb3 scsi: [ID 107833 kern.notice]    ASC: 0x4
(<vendor unique code 0x4>), ASCQ: 0x1, FRU: 0x2
May  9 08:58:51 emsdb3 md_stripe: [ID 641072 kern.warning] WARNING: md:
d1: write error on /dev/dsk/c2t1d0s0
May  9 08:58:56 emsdb3 md_mirror: [ID 104909 kern.warning] WARNING: md:
d7: /dev/dsk/c2t1d0s3 needs maintenance
May  9 08:58:56 emsdb3 md_mirror: [ID 104909 kern.warning] WARNING: md:
d1: /dev/dsk/c2t1d0s0 needs maintenance
_______________________________________________
sunmanagers mailing list
sunmanagers@sunmanagers.org
http://www.sunmanagers.org/mailman/listinfo/sunmanagers
Received on Tue May 28 16:16:03 2002

This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:42:44 EST