I received 23 responses to my survey, and a couple other opinions (even
though they didn't answer the survey questions).
I'm not a social scientist, so my survey wasn't exactly designed for
easy summarizing. I'll give you an idea of the overall feeling for the
SSA's here (which was what I sought when I sent out the survey). Also,
there are a few important problems (the NVRAM bug was already posted
here) which SSA managers should know about.
Please visit http://www.columbia.edu/~marg/misc/ssa/ if you want to see
the original texts of the responses, as well as my summary, and other
important information and more detailed discusison regarding all of the
following issues.
Hardware
Many disks have died but those losing the most disks still feel
the deaths are within the MTBF. Some people had problems with
flaky controllers similar to ours, but it seems as though once
you get a good batch of hardware things work pretty well.
NVRAM
If you are running RAID5, TURN FAST_WRITES OFF until you get new
firmware from Sun on April 15. Many, many respondents were
running with NVRAM enabled apparently without problem but
the only way to ensure that data corruption will not happen
is to turn it off.n
RAID5
There is a general consensus that the VxVM RAID5 implementation
may have a bug causing data corruption. Those running VxVM RAID0+1
had not lost data.
Hot Spares
In short, the failure of VxVM to automatically employ hot spares
is now a well-known bug. It will be fixed in the next release
of VxVM, due out in the summer.
Performance
Half of the people were satisfied with the performance (speed)
of the SSA. A quarter were very satisfied with the performance.
Another quarter couldn't really say (hadn't tested, or SSA's
were too new).
Overall Satisfaction
More than half said they were satisfied, would keep their
arrays, and might buy more. Nine said they weren't satisfied and
probably wouldn't buy more.
My Personal Opinion
We are continuing to evaluate the most current technologies
for NFS file service and Web service, as well as for grander
projects like our Digital Library Project (hierarchical storage
management, archiving, etc.). We were doing these evaluations even
before we had trouble with our arrays, but we're been researching
more vigorously lately :-) In the meantime, we will have to keep
these arrays and make them work the best we can. One result of
conducting this survey is that I've convinced myself that striping
and mirroring should not present the data corruption problems that
we saw when we were running RAID5 (I hope I'm right!) They really
are nifty machines (physically), and using vxmake and
configuration makefiles, they're pretty easy to configure and
maintain.
This is all I'll summarize here. Again, please visit my web page or
drop me a line if you want more details. I'll be updating it
periodically as I learn more about what was going wrong with our arrays.
It occurred to me that a mailing list for SSA managers might be useful
for the discussion of not only reliability issues but also
configuration, performance tuning, etc. e.g., we have a "clever" idea
of how to perform nearly online backups using dual-ported hosts and
arrays, extra "slush" disks, and a third backup host. Please let me
know if you are interested in being on (or running) such a list.
By the way, my survey "caused quite a stir" within Sun (according to my
sales rep). Apparently Scott McNealy has read it! :-) Too bad for us
they weren't so interested six months and six corrupted filesystems ago.
At any rate, they're intersted now. I hope they'll be able to fix these
bugs because the SSA really is a cool product.
Thanks to everyone who responded!
Margarita Suarez
Columbia University
Academic Information Systems
UNIX Systems Group
marg@columbia.edu
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:10:56 CDT