I asked for best practises and the supported method to replace A1000 batteries (with focus on wether or not its wise to do it online). I received a number of very helpful replies which i will summarize below. First - apologies for the very late summary, unfortunately our vendor took >2months to provide us with the 4 neccessary batteries 8^( ===== Responses (email addresses removed for spam prevention) ===== Albert Musin wrote: Hi, I did it online as written in RM instruction (recovery guru). It was OK. ------------------------------------------------- ponder wrote: While sun still claims it is impossible, we have changed about 20 batteries without turning the cabinets off in half a day. Don't forget to reset the battery age (can't remember command at the moment, check osa docs). And ofcourse this procedure is completely unsupported by sun... ------------------------------------------------- Sudesh Shetty told me: Hi Daniel, You can replace the battery online as i had done it before on an A1000 connected to a production server. ------------------------------------------------- Valuable insight from Sun came from Tony Walsh: It has never been a Sun engineering supported procedure to "hot swap" batteries in an A1000 array (rumoured to be related to litigation related..). The Sun statement of support states that a battery exchange should be made with all power removed from the array. I have found that this policy can often lead to much more drastic failures if the array has other issues that are not apparent (or not checked) before power is removed. It also gives disk drives a chance to cool down (even for a short period) and makes it much more likely that the drive may not spin up when power is reapplied. Not to be deterred by any of this, the approach I usually use is to give the customer the choice and advise that data lose is a possiblity, and then use the following process:- - Determine if the host can be shut down to the OK> prompt. If it can, this will stop all IO to the array and prevent data loss. At this point I just "warm swap" the battery and make sure it begins charging. - If the host must remain active the next level of data safety is to unmount all the file systems on the host(s), wait 10-15 minutes to ensure the cache is flushed, then exchange the battery and make sure it starts charging. - If all filesystems must remain mounted, I then turn off all "write through" cacheing using the RM6 command line or gui, wait at least 10-15 minutes to ensure the cache is flushed, and then change batteries. I have never yet had a data loss event or any other electrical misadventure, but it must be said again, the risk is all in your hands as the batteries are not designed as a "hot swap" item. In your case, as the batteries are already well beyond their "use-by" date, the cache "write through" has already been turned off, so I would just pull each battery in turn and replace it with a new one (replace the first battery before removing the next one). The only caution I would give is, make the change during a low activity period just in case there is an electrical misadventure. One other warning is to make sure the battery has been recharged within the last 6 months or the recharge operation in the array may take too long and then cause the "new" battery to be marked bad again. ------------------------------------------------- Gene Beaird has a different practise: I have been working with A1000s since about 2000, and have never heard of doing the battery change online. We have always scheduled an outage, powered the system off and swapped the batteries. Should have maybe tried it to one of our standby systems when I had the chance, but never did. I guess the potential power surge when you pull the battery out can fry some of the electronics on the system. ------------------------------------------------- Tim Chapman pointed out, that For what it is worth, a semi-recent summary to the list reiterates the requirement, "turn off, swap battery, power on, issue 'battery age reset' command" ... http://www.sunmanagers.org/pipermail/summaries/2002-July/002014.html maybe others will suggest otherwise ? ------------------------------------------------- Tod Sandman sent me this very helpful recipe: I always replace live by first turning off caching. I too heard rumors etc. and so recently verified my procedure with our Sun onsite guy. For what it's worth, he says yep, that's the way to do it. Let me know if you hear otherwise. Here are my typical steps: ## Define these 2: device=c1t5d0s0; luns=0,1,2,3,4 ## To check battery: raidutil -c ${device} -B ## To replace battery: ## Turn off caching if it is still enabled: raidutil -c ${device} -w off ${luns} ## Replace the battery. ## Reset battery age: raidutil -c ${device} -R ## Re-enable cache after replacing battery: raidutil -c ${device} -w on ${luns} ------------------------------------------------- Dirk Boenning summarized like this: Hello, Sun: You have to shutdown the whole system. Experience: Be sure to disable the cache. Change battery and enable cache again. Done couple of times without any problem. ------------------------------------------------- Gene Siepka knew: You can definitely do it online. Our local Sun Field Engineer says its like a urban legend that you must power off... I've done it on a A1000 just 2 months ago. No problem. ------------------------------------------------- My original mail: Hello List, we have a couple of A1000 Arrays with 8 and 12 disks. All of their batteries have exceeded their lifetime long ago and are to be considered disfunctional. As we hope to gain performance from a working cache we want to replace them; however as they are attached to HA-systems we would clearly prefer to do that "online", i.e. without shutting them down. I recall it used to be a supported procedure to replace the batteries like that, but i heard rumors that Sun has backed up on that. Does anyoen have insights, experiences, rumours (s)he could share on that topic? ------------------------------------------------- What i did: - used rm6 (shame on me for using a GUI) to turn cache off (of course it was disabled for years already, the oldest of the replaced batteries dated back to 1998) - got myself a console window to see kernel output, also had a look on messages - pulled old battery online (but, as recommended, at a time where it would hurt the minimal number of users if something bad happened). Pulling it out proved to be somewhat difficult for two of them because the adhesive labels on top suffered and rolled themselves up as i dragged the cannister out, i had to use some force to get them out the last few centimeters. - slid new one in - watch console and messages - eventually 3 messages per battery would show up, looking like this: raid: [ID 702911 user.error] Sense=700006000000009800000000A0000000000000000000000000000000000000000000800 000082C000000000000000000000000000B053154383032303130303720202020202003010300 00010000000000000000000000000000000000000000000000000005000000000000000000000 0000000000000000000000000000000000000000000000009A412A93039313730342F30393339 353400000000000000 no need to panic though its just this one message. - reset battery age as suggested by Tod by doing raidutil -c ${device} -R - turned caching on again (with rm6 again, who told me right away that cache was temporarily disabled again while the batteries were loaded). - checked cache status 20 min later, it was on and working - tested performance with iozone, performance improvement was like factor 2.5 Thanks everyone for the insights. Daniel _______________________________________________ sunmanagers mailing list sunmanagers@sunmanagers.org http://www.sunmanagers.org/mailman/listinfo/sunmanagersReceived on Wed Sep 22 10:31:45 2004
This archive was generated by hypermail 2.1.8 : Thu Mar 03 2016 - 06:43:37 EST