Classic one: bitten by RAID :(

Posted by: julf

Classic one: bitten by RAID :( - 25/11/2012 10:56

Ouch. How much more textbook can you get?

Have 4-drive RAID server as main storage (and staging post for long-term backups - yes, RAID is not a backup, except temporarily). So last week I got a couple of SMART notices of recoverable errors on one of the disks. Time to replace it... Get new disk, pull old - but get bitten by the usual inconsistent mapping of logical to physical drives, so pull the wrong one. And having a brain fart, put it back. Rebuild triggered.

Oh well, the failing disk hasn't failed yet, so just have to wait for rebuild to finish before replacing the right disk. Except... Yes, you know where this is going... At "96.7% complete", there is a non-recoverable bad sector and the second disk gets failed out of the array, with rebuild of array not completed.

OK, now running ddrescue to recover as much as I can from the disk with bad sectors before I do anything else...

Posted by: mlord

Re: Classic one: bitten by RAID :( - 25/11/2012 12:15

Then patch the kernel (Linux, right?) to just ignore the bad sector and continue, instead of voiding the entire fricken array, and you can then recover nearly all of the data.

Next time, use unRAID rather than RAID. Or mhddfs.
Or *something* (anything) other than horribly unrobust RAID.
It simply is not suitable for huge TB+ drives at home.

Cheers

Posted by: julf

Re: Classic one: bitten by RAID :( - 25/11/2012 12:47

Originally Posted By: mlord

Then patch the kernel (Linux, right?) to just ignore the bad sector and continue, instead of voiding the entire fricken array, and you can then recover nearly all of the data.

Yes, definitely tempted. But not entirely trivial (until now I have had no need to look at the kernel RAID code).

Quote:

Next time, use unRAID rather than RAID. Or mhddfs.
Or *something* (anything) other than horribly unrobust RAID.
It simply is not suitable for huge TB+ drives at home.

Have to agree - it's the classic problem of starting out using something that worked OK under the then prevailing conditions, and then doing small upgrades without biting the bullet and replacing the thing...

Posted by: peter

Re: Classic one: bitten by RAID :( - 25/11/2012 13:18

Originally Posted By: julf

pull the wrong one. And having a brain fart, put it back. Rebuild triggered

Why a brain fart? It was game over then anyway, wasn't it? Unless your array has a "whoops, sorry, didn't mean to eject that" button, which seems unlikely, especially if any writes have happened in the interim, then even copying the degraded array off to a known good location would have failed 96.7% of the way through reading the duff disk.

Peter

Posted by: julf

Re: Classic one: bitten by RAID :( - 25/11/2012 13:47

Originally Posted By: peter

Why a brain fart? It was game over then anyway, wasn't it?

Well, a very quick stopping/unmounting of the array might just have saved the situation, if there was no dirty buffers to write out...

Also, the disk that got pulled was of course 100% OK at that point, it was just that the other disks considered the pulled disk as failed/unclean - patching that (instead of allowing reconstruction) would probably have been a way out.

Posted by: julf

Re: Classic one: bitten by RAID :( - 26/11/2012 14:41

Originally Posted By: mlord

Next time, use unRAID rather than RAID. Or mhddfs.

I guess unRAID is not available just as a file system - you have to run the complete dedicated server/utility OS?

I don't see how mhddfs solves the "reliable redundancy" issue.

Might have to look into ZFS...

Posted by: mlord

Re: Classic one: bitten by RAID :( - 26/11/2012 21:21

Originally Posted By: julf

I don't see how mhddfs solves the "reliable redundancy" issue.

It doesn't. It solves the "make one big filesystem from a bunch of drives" problem, without losing everything when one drive goes bad.

unRAID is similar, except they add a parity drive in parallel with the data drives, permitting loss of a single drive with no data loss. And loss of subsequent drives without losing everything.

Yeah, pity unRAID wants to be standalone (or in a VM).

Posted by: julf

Re: Classic one: bitten by RAID :( - 27/11/2012 11:39

Right, ZFS looks like the best solution right now.

Anyway, I am a happy bunny - (g)ddrescue managed, after a couple of tries, to read all blocks off the failing disk. Replaced failing disk with ddrescued copy, forced a resync, and everything is hunky dory again. For now.

Posted by: drakino

Re: Classic one: bitten by RAID :( - 27/11/2012 16:14

Please share your experience with ZFS when you implement it. I've been thinking of doing similar here, but haven't implemented it yet. There is even a ZFS stack that has been evolving a bit for OS X that I'm following.

Posted by: julf

Re: Classic one: bitten by RAID :( - 27/11/2012 17:18

Will do!

Posted by: andy

Re: Classic one: bitten by RAID :( - 27/11/2012 18:46

I haven't tried the OSX ZFS stack myself, but this is a quote from a friend who tried it back in June:

"Currently shuffling all the data off my Zevo zfs formatted drive on my MBP so I can reformat it back with bad old HFS+

I can cause system crashes that are directly caused by Zevo not handling particularly intensive bouts of disk activity on large numbers of tiny files. Which pretty much describes how Aperture hits its database. Which is pretty mission critical for me - could do without any instability, least of all something that hits my photo databases."