Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×

Best Server Storage Setup? 76

new-black-hand asks: "We are in the process of setting up a very large storage array and we are working toward having the most cost-effective setup. Until now, we have tried a variety of different architectures, using 1U servers or 6U servers packed with drives. Our main aims are to get the best price per GB of storage that we can, while having a reliable and scalable setup at the same time. The storage array will eventually become very large (in the PB range) so saving just a few dollars on each server means a lot. What do people out there find is the most effective hardware setup? Which drives and of what size? Which motherboards, etc? I am familiar with the Petabox solution which is what the Internet Archive uses — they have made good use of Open Source software. So what are some of the architectures out there that, together with Open Source, can give us a storage array that is much better than the $3 per GB plus that the commercial vendors ask for?"
This discussion has been archived. No new comments can be posted.

Best Server Storage Setup?

Comments Filter:
  • Coraid (Score:5, Informative)

    by namtro ( 22488 ) on Wednesday June 14, 2006 @12:04AM (#15529710)
    If $/GB is a dominant factor, I would suggest Coraid's [coraid.com] products. They have a pretty niffy technology which is dead simple and extensively leverages OSS. From my personal experience as a customer, I think they are a bunch of good folks as well. They also seem to constantly be wringing more and more performance our of their systems. Anyway, something to explore if I were you.
  • Xserve RAID? (Score:4, Informative)

    by CatOne ( 655161 ) on Wednesday June 14, 2006 @12:13AM (#15529756)
    It's just fibre attached storage, so you can use whatever servers you want as the head units. List is 7 TB for $13K... if you're going to scale up a lot you can certainly ask Apple for a discount.

    Not *the* cheapest out there, but fast, reliable, and works well with Linux or whatever server heads you want.
  • It all depends (Score:3, Informative)

    by ar32h ( 45035 ) * <jda@ta p o d i . net> on Wednesday June 14, 2006 @12:16AM (#15529768) Homepage Journal
    It all depends on what you are trying to do

    For some workloads, many servers with four drives each may work. This is the Petabox [capricorn-tech.com]/Google model. This works if you have a parallelisable problem and can push most of your computation out to the storage servers.
    Remember, you don't have a 300Tb drive, you have 300 servers, each with 1Tb of local storage.

    For other workloads, you need a big disk aray and SAN, probably from Hitachi [hds.com], Sun [sun.com], or HP [hp.com]. This is the traditional model. Use this if you need a really big central storage pool or really high throughput.
    Many SAN arrays can scale into the PB range without too much trouble.

    Nowadays, a PB is enough to arch your eyebrows, but otherwise not that amazing. It seemes that commercial storage leads home storage by 1000. When home users had 1GB drives, 1TB was amazing. Now that some home users have 1TB, many companies have 1PB.
  • Lots of copies. (Score:3, Informative)

    by Anonymous Coward on Wednesday June 14, 2006 @12:24AM (#15529806)
    Rendundancy, aviability, and reliability is best served in the most cost effective way by using the 'lots of copies' method of storage.

    You simply get the cheapest PCs you can get. The Petabox uses Mini-itx systems due to low power and thermal requirements. Stuff them with the biggest ATA drives you can get for a good price.

    Then fill up racks and racks and racks of stuff like that. As long as the hardware is fairly uniform then you have some spares and you have maintainance taken care of.

    Then you copy your information on it, use a database (redundant also) to keep track of your files and such. Scripts and programs that keep files in sync and spread out on multiple machines so there is no single point of failure. I don't know of any sort of OSS software that does this that is out there, but I would bet that some place like Archive.org will share what they use.

    Think about it like you have a huge library of tapes for backup. Then you have a automated method to use a robot to fetch, copy, and return various tapes as you need them.. Except that instead of automated hardware and tapes your using software and harddrives.

    This sort of system will scale much higher then any sort of currently aviable SANS solution.

    NOW... If you want all relaibility, aviability, redundancy with SPEED then that is going to fucking cost you huge amounts of money.

    This is purely for STORAGE, not for work scratch space.

    Personally were I work we have a 3 terrabytes worth of disk space for 2 copies of our major production database and scratch work space. The information is completely refreshed every 2-3 months. No data is kept older then that. In the back we have about 22 thousand tapes for longer term storage and backups for weekly jobs...

    Personally I'd rather have a room full of mini-itx machines to go along with the SAN rather then a bunch of fucking tapes.
  • What's it for ? (Score:5, Informative)

    by drsmithy ( 35869 ) <drsmithy@nOSPAm.gmail.com> on Wednesday June 14, 2006 @01:58AM (#15530177)
    You firstly need to assess what the storage is for - in particular, its *requirements* for performance and reliability/availability/uptime.

    If you require high levels of performance (=comparable to local direct-attached-disk) or reliability (=must be online "all the time") then stop right now and go out talking to commercial vendors. You will not save enough money doing it yourself to make up for the stress, people-power overheads and losses the first time the whole house of cards falls down.

    However, if your performance or reliability requirements are not so high (ie: it's basically being used to archive data and you can tolerate it going down occasionally and unexpectedly) then doing it yourself may be a worthwhile task. I get the impression this is the kind of solution you're after, so you'll be looking at standard 7200rpm SATA drives.

    Firstly, decide on a decent motherboard and disk controller combo. CPU speed is basically irrelevant, however, you should pack each node with a good 2G+ of RAM. Make sure your motherboards have at least two 64bit/100Mhz PCI-X buses. I recommend (and use) Intel's single-CPU P4 "server" motherboards and 3ware disk controllers. I believe the Areca controllers are also quite good. You will have trouble on the AMD64 side finding decent "low end" motherboards to use (ie: single CPU boards with lots of I/O bandwidth). Do not skimp on the motherboards and controllers, as they are the single most important building blocks of your arrays.

    Secondly, pick some disks. Price out the various available drives and compare their $/GB rates. There will be a sweet spot were you get the best ratio, probably around the 400G or 500G size these days (it's been several months since I last put one of these together).

    Thirdly, find a suitable case. I personally don't like to go over 16 drives per chassis, but there are certainly rackmount cases out there with 24 (and probably more) hotswap SATA trays.

    Now, put it all together and run some benchmarks. In particular, benchmark hardware RAID vs Linux software RAID and see which is faster for you (it will probably be software RAID, assuming your motherboard is any good). Bear in mind that some hardware RAID controllers do not support RAID6, but only RAID5. Prefer a RAID6 array to a RAID5 + hotspare array.

    You now have the first component of your Honkin' Big Array. Install a suitable Linux distribution onto it (either use a dedicated OS hard disk, get some sort of solid-state device or roll a suitable CD-ROM based distro for your needs). Setup iSCSI Enterprise Target.

    Finally, you need a front-end device to make it all usable. Get yourself a 1U machine with gobs of network and/or bus bandwidth. I recommend one of Sun's x4100 servers (4xGigE onboard + 2 PCI-X). Throw some version of Linux on it with an iSCSI initiator. Connect to your back-end array node and set it up as an LVM PV, in an LVM VG. Allocate space from this VG to different purposes as you require.

    When you start to run out of disk, build another array node, connect to it from the front-end machine and then expand the LVM VG. As you expand, investigate bonding multiple NICs together and additional dual- or quad-port NICs to supplement the onboard ones. I also recommend keeping at least one spare disk in the cupboard at all times for each of your storage nodes, and also a spare motherboard+CPU+RAM combo, to rebuild most of a machine quickly if required. Ideally you'd keep a spare disk controller on hand as well, but these tend to be expensive, and if you're using software RAID, any controller with enough ports will be a suitable replacement for any failures.

    We do this where I work and have taken our "array" from a single 1.6T node (12x200G drives) to 10T split amongst 3 nodes. We are planning to add another ~6T node before the end of the year. *If* this is the kind of solution that would meet your needs, I can offer a lot of help, advice and experience to you.

    However, our "array" has neither high perfo

  • by TTK Ciar ( 698795 ) on Wednesday June 14, 2006 @04:51AM (#15530627) Homepage Journal

    It's been almost exactly two years since we put together the first petabox rack [ciar.org], and both the technology and our understanding of the problem have progressed since then. We've been working towards agreement on what the next generation of petabox hardware should look like, but in the meantime there are a few differing opinions making the rounds. All of them come from very competent professionals who have been immersed in the current system, so IMO all of them are worthy of serious consideration. Even though we'll only go with one of the options, a different option might be better suited to your specific needs.

    One that is a no-brainer is: SATA. The smaller cables used by SATA are a big win. Their smaller size means (1) better air flow for cooling the disks, and (2) fewer cable-related hardware failures (fewer wires to break, and more flexibility). Very often when Nagios flags a disk as having gone bad, it's not the disk, but rather the cable that needs reseating or replacing.

    Choosing the right CPU is important. The VIA C3's we started with are great for low power consumption and heat dissipation, but we underestimated the amount of processing power we needed in our "dumb storage bricks". The two most likely successors are the Geode and the (as-yet unreleased, but RSN) new low-power version of AMD's dual-core Athlon 3800+. But depending on your specific needs you might want to just stick with the C3's (which, incidentally, cannot keep gig-e filled, so if you wanted full gig-e bandwidth on each host, you'll want something beefier than the C3).

    It has been pointed out that we could get better CPU/U, disks/U, $$/U, and power/U by going with either a 16-disks-in-3U or 24-disks-in-4U solution (both of which are available off-the-shelf), compared to 4-disks-in-1U (our current hardware). This would also make for fewer hosts to maintain, and to monitor with the ever-crabby Nagios, and require us to run fewer interconnects. Right now it looks like we'll probably stick with 4-in-1U, though, for various reasons which are pretty much peculiar to our situation.

    I heartily recommend Capricorn's petabox hardware for anyone's low-cost storage cluster project, if for no other reason than because a lot of time, effort, brain-cycles, and experimentation was devoted to designing the case for optimizing air flow over the disks (and figuring out which parts of the disks are most important to keep cool). Keeping the disks happy will save you a lot of grief and money. When you're keeping enough disks spinning, having a certain number of them blow up is a statistical certainty. Cooler disks blow up less frequently. Most cases do a really bad job of assuring good air flow across the disks -- the emphasis is commonly on keeping the CPU and memory cool. But in a storage cluster it's almost never the CPU or memory that fails, it's the drives.

    Even though the 750GB Seagates appear to provide less bang-for-buck than smaller solutions (400GB, 300GB), the higher data storage density pays off in a big way. Cramming more data into a single box means amortizing the power/heat cost of the non-disk components better, and also allows you better utilization of your floorspace (which is going to become very important, if you really are looking to scale this into the multi-petabyte range).

    When dealing with sufficiently large datasets, silent corruption of data during a transfer becomes inevitable, even using transport protocols which attempt to detect and correct such corruptions (since the corruption could have occurred "further upstream" in the hardware than the protocol is capable of seeing). We have found it necessary to keep a record of the MD5 checksums of the contents of all our data, and add a "doublecheck" step to transfers: perform the transfer (usually via rsync), make sure the data is no longer being cached in memory (usually by performing additional transfers), and then recompute the MD5 checksums on the destination host and c

  • by georgewilliamherbert ( 211790 ) on Wednesday June 14, 2006 @01:21PM (#15533471)
    Do not use XServe RAID. It's the worst possible pseudo-enterprise SAN product. This is not to rag on Apple in general - the company is full of smart people, many of whom are friends. This is just a lame product in an otherwise excellent product line. There are plenty of SATA based SAN storage devices out there which are cheap. I'm partial to Nexsan [nexsan.com], having worked with them, and if you need slightly higher quality the Sun Storagetek, EMC/Dell boxes, etc. Software RAID (Veritas or open source) striping on top of large HW RAID (RAID 5, or RAID 10) SAN storage array stacks works just fine.
  • by Anonymous Coward on Wednesday June 14, 2006 @05:36PM (#15535349)
    Pick any two.

    My personal suggestion is to find a bunch of older FC arrays (Compaq RA4100?) on eBay and load them up. But here's where you get into trouble:

    Fast, reliable, cheap. You can have any two. We stripe for speed. We mirror for reliablility. We parity-stripe for cheap.

    Here's what you give up with each choice.

    Striping alone is fast and cheap, but you have no fault tolerance.
    Mirroring doubles your cost, but it is reliable and reasonably fast.
    Striping with parity is cheap, and reasonably reliable, but you pay a huge performance penalty in write operations.

    When reliablility and speed are the utmost concerns, use SAME (Stripe And Mirror Everything.)

    As for your hardware, find whatever is cheap on eBay and run with it until you make enough money for the Tested, Supported Commercial Solutions.

Solutions are obvious if one only has the optical power to observe them over the horizon. -- K.A. Arsdall

Working...