brianwski - Slashdot User

Comment Re:Original blog post (Score 1) 239

by brianwski on Thursday July 21, 2011 @08:53PM (#36841450) Attached to: Build Your Own 135TB RAID6 Storage Pod For $7,384

We don't have any data yet. The oldest Backblaze pods contain hard drives that are not quite 4 years old, so we haven't seen any old age mortality yet.

Here is another totally random thought: We pay $1,400 / month / cabinet in physical space rental plus electricity, which comes to about $5 / drive / month / cabinet. Even if the old (smaller) drives last forever, there will come a moment where it is just a good financial decision to copy all the data off of them onto denser drives because in some number of months the savings in physical space rental and saving electricity (assuming energy use is fixed per drive) pays for the new drives. If a new hard drive is 10 times as dense, it saves Backblaze $4.50 / drive / month in physical space and electricity rental. In 22 months it pays for the $100 replacement drive. (I did that math super quick, so let me know if I'm off by a factor of 10.)

Comment Re:Anything over 2TB should be ZFS... (Score 2) 239

by brianwski on Thursday July 21, 2011 @03:11PM (#36837386) Attached to: Build Your Own 135TB RAID6 Storage Pod For $7,384

Using JFS instead of ZFS is the biggest mistake for this build.

(Disclaimer: I work at Backblaze) - We no longer deploy new pods with JFS, but over half our fleet of 200 pods are running JFS and we are perfectly happy with it. We worked through a couple bugs related to large volumes, but after that our main reason for using EXT4 going forward is that in our application EXT4 is measurably faster than JFS, and it is reassuring to be on a filesystem that is used by more people so it (hopefully) has more bugs fixed, etc.

Earlier we were totally interested in ZFS, as it would replace RAID & LVM as well (and ZFS gets great reviews). But (to my understanding) native ZFS is not available on Linux and we're not really looking to switch to OpenSolaris.

ANOTHER option down this line of thinking is switching to btrfs, but we haven't played with it yet.

Comment Re:Meh (Score 1) 239

by brianwski on Thursday July 21, 2011 @03:01PM (#36837290) Attached to: Build Your Own 135TB RAID6 Storage Pod For $7,384

The drives do not look to be hot swapable

(Disclaimer: I work at Backblaze) All SATA drives are inherently hot swappable, including the ones in the Backblaze pod. We have tried it, it worked the few times we did it. But for normal operations, we shut the pod down completely to swap drives. The first reason is that because the pods are stacked on top of each other and the drives are replaced from the top, we have to slide the pod out half way out of the rack like a drawer. It feels kinda wrong to slide servers around like that while the drives are spinning, so we avoid it (I have no proof it actually causes significant problems). Another reason is that with the top of the pod open, the cooling airflow isn't the same and some of the drives in the center start rising in temperature. This isn't fatal, but it puts you on a "timer" where you want to get the hot swap done within a reasonable amount of time (like 5 minutes) and get the pod closed back up again. Finally, it just seems safer to let the machine come up cleanly with the drive replaced. For our application it doesn't matter at all, no customer can possibly know or care if one, two, or ten pods are offline during a reboot.

Comment Re:Original blog post (Score 1) 239

by brianwski on Thursday July 21, 2011 @02:23PM (#36836902) Attached to: Build Your Own 135TB RAID6 Storage Pod For $7,384

RAID6 uses 2 drives for data parity, so I believe you would need 3 drives out of 45 to fail within a week to actually lose data. I suspect they would shut a pod down if 2 drives in it failed at the same time.

(Disclaimer: I work at Backlaze) We have 3 RAID groups inside each 45 drive pod, each RAID group is 15 drives. So you need 3 drive failures out of one single 15 drive group to lose data. So... when the FIRST drive fails in one 15 drive RAID group, our software automatically stops accepting any more customer data on that particular 15 drive group and the management software puts the file system sitting on top of that RAID gruop into read only mode. This may seem obvious in retrospect, but we found writing to drives causes them to fail or pop out of RAID arrays at more than 100 times higher frequency than just keeping them spinning and reading the information off of them. So by doing this the customers can still restore data from that pod, and we're pretty relaxed about replacing that particular drive sometime in the next few days.

When a second drive subsequently fails inside a pod, pagers start going off and a Backblaze employee starts driving towards the datacenter.

With that said, it is worth noting that multiple simultaneous drive failures in one pod are WAAAY more common than pure statistics would indicate. If a SATA card fails, it has three SATA cables plugged into it leading to three separate port multipliers and ultimately is talking with 15 hard drives. So we'll see 15 drives simultaneously drop out of the RAID arrays in one pod and it's pretty obvious what just happened. No big deal, it doesn't (necessarily) corrupt any data. I'm just mentioning you can't take the random drive failure rates of one single drive and do straight multiplication to get to pod failure rates.

Comment Re:Anything over 2TB should be ZFS... (Score 4, Interesting) 239

by brianwski on Thursday July 21, 2011 @01:43PM (#36836454) Attached to: Build Your Own 135TB RAID6 Storage Pod For $7,384

... if you really care about the data.

(Disclaimer: I work at Backblaze) - If you really care about data, you *MUST* have end-to-end application level data integrity checks (it isn't just the hard drives that lose data!).

Let's make this perfectly clear: Backblaze checksums EVERYTHING on an end-to-end basis (mostly we use SHA-1). This is so important I cannot stress this highly enough, each and every file and portion of file we store has our own checksum on the end, and we use this all over the place. For example, we pass over the data every week or so reading it, recalculating the checksums, and if a single bit has been thrown we heal it up either from our own copies of the data or ask the client to re-transmit that file or part of that file.

At the large amount of data we store, our checksums catch errors at EVERY level - RAM, hard drive, network transmission, everywhere. My guess is that consumers just do not notice when a single bit in one of their JPEG photos has been flipped -> one pixel gets every so slightly more red or something. Only one photo changes out of their collection of thousands. But at our crazy numbers of files stored we see it (and fix it) daily.

Slashdot Top Deals