Brian from Backblaze here. You assume we use RAID (inside of one computer), which is incorrect. We wrote our own layer where any one piece of data is Reed Solomon encoding across 20 different computers in 20 different locations in our datacenter (which is using some of the excellent ideas from RAID and ditching some of the parts that don't work well in our particular application). Our encoding happens to be 17 data drives plus 3 parity. We can make our own decisions about what to do with timeouts. When doing reads, we ask all 20 computers for their piece, and THE FIRST 17 THAT RETURN are used to calculate the answer. Now if one of the computers does not respond at all we send a data center tech to replace it. But if it was just momentarily slow a few times a day we let it be (we don't eject it from the Reed Solomon Group).
> These drives are only meant to be powered on a few hours a day and consumer workload duty cycles
I think a really interesting study would be to power a few thousand drives up once per day for an hour and shut them down. Compare it to a control group of the same drives left on so their temperature did not fluctuate. See which ones last longer without failure. I honestly don't have the answer. (Really, I don't.) What I do know is that Backblaze has left 61,590 hard drives continuously spinning, most of these are often labeled as "consumer drives", and that the vast majority of drives last so long that we copy the data off onto massively more dense drives (like copying all the data off a 1 TByte drive into an 8 TByte drive) not because the 1 TByte fails, but because it ECONOMICALLY MAKES SENSE. An 8 TByte drive takes less electricity per TByte, takes 1/8th the rack space rental, etc. So Backblaze honestly wouldn't care if the "Enterprise Drives" lasted 10x as long in our environment-> we would STILL replace them at the same moment.