DonChron - Slashdot User

Comment MTBF rate calculation method is flawed (Score 2, Insightful) 283

by DonChron on Saturday April 05, 2008 @04:22PM (#22974826) Attached to: Disk Failure Rates More Myth Than Metric

Drive manufacturers take a new hard drive, run a hundred drives or so for some number of weeks, and measure the failure rate. Then they extrapolate that failure rate out to thousands of hours... So, let's say one in 100 drives fail in a 1000-hour test (just under six weeks). MTBF = 100,000 hours, or 11.4 years!

To make this sort of test work, it must be run over a much longer period of time. But in the process of designing, building, testing and refining disk drive hardware and firmware (software), there isn't that much extra time to test drive failure rates. Want to wait an extra 9 months before releasing that new drive, to get accurate MTBF numbers? Didn't think so. How many different disk controllers do they use in the MTBF tests, to approximate different real-world behaviors? Probably not that many.

Could they run longer tests, and revise MTBF numbers after the initial release of a drive? Sure, and many of them do, but that revised MTBF would almost always be lower, making it harder to sell the drives. On the other hand, newer drives are certainly available every quarter, so it may not be a bad idea to lower the apparent value of older drive models.

So, it's better to assume a drive will fail before you're done using it. They're mechanical devices with high-speed moving parts, very narrow tolerable ranges of operation (that drive head has to be far enough away from the platters not to hit them, but close enough to read smaller and smaller areas of data). Anyone who's worked in a data center, or even a small server room, knows that drives fail. When I've had around two hundred drives, of varying ages, sizes and manufacturers, in a data center, I observed a failure rate of five to ten drives per year. This is well below the MTBF for enterprise disk array drives (SCSI, FC, SAS, whatever), but drives fail. That's why we have RAID. Storage Review has a good overview of how to interpret MTBF values from drive manufactures.

Slashdot Top Deals