Eh?
> At some point you have to ask why you're using RAID at all. If it's for always-on, avoiding data loss due to hardware failures, and speed, then RAID 6 isn't really am great solution for avoiding data loss when disks get to these kinds of sizes, the chances of getting more than one disk fail simultaneously is approaching one, and obviously it was never great for speed.
If you're at this point, then using drives at all is probably already off the table. But I think this position is probably ridiculous.
I have many years of experience managing file clusters in scopes ranging from SOHO to serving up to 15,000 people at a time in a single cluster. In a cluster of 24 drives under these constant, enterprise-level loads, I saw maybe 1 drive fail in a year.
I've heard this trope about "failure rate approaching 1" since 500GB drives were new. From my own experience, it wasn't really true then, any more than it's true now.
Yes, HDDs have failure rates to keep in mind, but outside the occasional "bad batch", they are still shockingly reliable. Failure rates per unit haven't changed much, even though with rising capacities, that makes the failure rate per GB rise. It still doesn't matter as much as you think.
You can have a great time if you follow a few rules, in my experience:
1) Engineer your system so that any drive cluster going truly offline is survivable. AKA "DR" or "Disaster Recovery". What happens if your data center gets flooded or burns to the ground? And once you have solid DR plans, TRUMPET THE HECK OUT OF IT and tell all your customers. Let them know that they really are safe! It can be a HUGE selling point.
2) Engineer your system so that likely failures are casually survivable. For me, this was ZFS/RAIDZ2, with 6 or 8 drive vdevs, on "white box" 24 bay SuperMicro servers with redundant power.
3) If 24x7x36* uptime is really critical, have 3 levels of redundancy, so even in a failure condition, you fail to a redundant state. For me engineering at "enterprise" level, we used application-layer logic so there were always at least 2 independent drive clusters containing full copies of all data. We had 3 drive clusters using different filesystem technologies (ZFS, XFS/LVM) and sometimes we chose to take one offline to do filesystem level processing or analysis.
4) Backups: You *do* have backups, and you do adhere to the 3-2-1 rule, right? In our case, we used ZFS replication and merged backups and DR. This combined with automated monitoring ensured that we were ready for emergencies, which did happen and were always managed in a satisfactory way.