Originally Posted By: msaeger
I did not know RAID was that fragile I have no experience with it but the way it was talked about when I went to school was you get a disk failure you just slap in a new on and it fixes itself with no down time.

RAID can survive the failure of some number of drives, depending on the type of RAID, without downtime, assuming that there are no further issues. (Sometimes a single failed drive can take out an entire IO channel.) When you replace the drive, it will reconstruct the data on the failed drive.

The problem these days is that drives are so large that it takes a long time to recover that data, and you run the risk of another drive failing before the replacement disk's data gets rebuilt. If more drives fail than the RAID set can handle, your data is completely gone and you have to restore from backup. Add to that problem that drives will often fail in clusters, since drives from a single production run will often have very similar lifespans and usage patterns are very similar for all members of a RAID set, and you can get into problems more quickly than you'd like.

There are a number of ways to help alleviate those issues. One is by using "hot spares", where you have one or more drives in the system and running idle waiting for another drive to go bad, which minimizes the amount of time that a RAID set will run in a degraded mode, since it doesn't have to wait for a person to physically replace the bad drive. Another is using a RAID type that can tolerate more drive failures. Typically, if someone says "RAID" without any qualifier, he probably means RAID5, which can tolerate the failure of a single drive. Other versions can tolerate other finite numbers of drives per RAID set (that is: two, three, etc.). And you can even combine the types together. Still other versions have complete duplicates of the data, so that every disk might have one or more backups. The tradeoff here is between an increasing number of drives needed to store a given amount of data and the number of drives that can be lost without resorting to a backup.

In Mark's system, any drive failure results in a restore from backup, but it doesn't affect all of the data on all of the drives, which is what would happen on a RAID set that lost more drives than it could tolerate, but only the data that happened to reside on that one drive.
_________________________
Bitt Faulk