Open Source Moving in on the Data Storage World

Open Source Moving in on the Data Storage World 169

Posted by ScuttleMonkey on Wednesday April 26, 2006 @05:52PM from the sowing-data-seeds dept.

pararox writes "The data storage and backup world is one of stagnant technologies and cronyism. A neat little open source project, called Cleversafe, is trying to dispell of that notion. Using the information dispersal algorithm originally conceived of by Michael Rabin (of RSA fame), the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data. The software is also very scalable, allowing you to run your own backup grid on a single desktop or across thousands of machines."

Open Source Moving in on the Data Storage World

This discussion has been archived. No new comments can be posted.

Search 169 Comments Log In/Create an Account

Comments Filter:

I don't think you know what that word means. . . . (Score:2, Interesting)

by Ohreally_factor ( 593551 ) writes: on Wednesday April 26, 2006 @05:54PM (#15208120) Journal

The data storage and backup world is one of stagnant technologies and cronyism.

Backup for Backuper? (Score:3, Interesting)

by foundme ( 897346 ) writes: on Wednesday April 26, 2006 @05:57PM (#15208130) Homepage

I can't find this in the FAQ -- is there a "creator/seeder" in the whole process? Which means a particular group of slices can only be unlocked by a particular seeder created by Turbo IDA.

If there is a creator/seeder, then we are still burdened by having to keep this seeder safe so that we can retrieve the distributed slices.

If there is no creator/seeder, is this safe enough so that people cannot patch slices together by way of trial-and-error?

Think RAID5, only way better (Score:5, Interesting)

by El Cubano ( 631386 ) writes: on Wednesday April 26, 2006 @06:06PM (#15208204)

Using the information dispersal algorithm originally conceived of by Michael Rabin (of RSA fame), the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data.
It seems like this can be tuned to provide varying levels of fault tolerance. According to the abstract (I don't have an ACM web account, and I couldn't find the full text), it seems like I can take a file and make it so that any four chunks can be used to rebuild the file. I can then take those chunks and distribute them eight times to different machines. Thus, five of the eight machines would have to be rendered inoperable before I were unable to retrieve my data.
If I understand it correctly, then this is really slick.

Rar + Par + BitTorrent? (Score:5, Interesting)

by DigitalRaptor ( 815681 ) writes: on Wednesday April 26, 2006 @06:17PM (#15208272)

This sounds like Rar, Par, and BitTorrent got merged in some freak transporter accident...

Par files (for use with QuickPar, etc) are great, saving all sorts of extra posting on binary newsgroups.

You mean Shamir, not Rabin (Score:5, Interesting)

by Anonymous Coward writes: on Wednesday April 26, 2006 @06:20PM (#15208296)

While R in RSA stands for Ron Rivest, it is Adi Shamir (S of RSA) you have in mind. He came up with a wonderful secret sharing scheme which allows a bunch of folks or computers to keep pieces of secret in such a way that no N of them have any idea what the secret is, even if they collude. OTOH N+1 of them can easily figure out the secret. RSA can help you keep important secrets safe this way: if the owner is OK, the secret cannot be recreated; if the owner quits or dies, all-important secret holders can recover his password and unencrypt critical company data. And if a couple of them cannot participate, you still can get your secret back.

Even more amazingly Shamir's secret sharing scheme allows computing math functions, such as digital signatures, without ever recovering secret keys. This is called threshold cryptography, some of you may be interested to learn about its many wonders. Shamir rocks and so is threshold crypto!

innovation (Score:2, Interesting)

by Ajehals ( 947354 ) writes: on Wednesday April 26, 2006 @06:24PM (#15208320) Journal

Any innovation (if that's what this is - no doubt it will turn out to be something that someone else thought of in the 80's..) is welcomed in this area.
Maybe one day vendors will stop pushing overly expensive and utterly bland storage solutions. i.e. Last time I had a meeting about storage the product was: 2x Servers 2x Disk Arrays with possible storage of a little under 2TB (using 24 80Gb SCSI HDDs) with RAID 5, Oh and the storage was presented as 4 @500Gb drives to the OS (Some proprietary thing). all in at a cool £27.000, (and that was before the license for CIFS) guess how it was billed - innovative... Its a joke, so the solution? In the meantime lots of SATA Drives and file replication, eventually? maybe we can make use of all that storage that sits on every machine on the LAN that is never used...

Virtual file server -- was a program for old Macs (Score:5, Interesting)

by dfloyd888 ( 672421 ) * writes: on Wednesday April 26, 2006 @06:28PM (#15208342)

In the early 90s, a company made a virtual file server for networked Macs. Each client Macintosh had a file on its hard drive, and when a request was made through the driver, a number of Macs were contacted, and files were read and written to in a fairly load balanced fashion. I'm pretty sure it used some decent (think single DES) encryption at the time too, so someone couldn't just dig through the server's file on their Mac's hard disk and glean important data. It also added some redundancy, so if a Mac or two wasn't up on the network, it wouldn't kill the virtual Appleshare folder.

By chance, anyone remember this technology? I have no idea what happened to it, but it would be a blockbuster open source app if done today, and was platform independant. If done right, one could create data brokerage houses, where people could buy and sell storage space, and also reliability, where space on a RAID or server array would be of higher value than space on a laptop that is rarely on the Internet.

Re:Editors, please note! (Score:3, Interesting)

by Jake73 ( 306340 ) writes: on Wednesday April 26, 2006 @06:32PM (#15208362) Homepage

Really? This is just error correction. Reed-Solomon [wikipedia.org] error correction, and even the Chinese Remainder Theorem [psu.edu] can be applied to reconstruct data when some has been intentionally or unintentionally punctured.

Re:Think RAID5, only way better (Score:4, Interesting)

by dracken ( 453199 ) writes: on Wednesday April 26, 2006 @06:33PM (#15208366) Homepage

Rabin's algorithm relies on a nifty trick. If you take a k dimensional vector and store the dot product with k orthogonal vectors then the vector can be reconstructed using just the dot product. This is a fancy way of saying any point on the x-y plane can be located if you have the x-coordinate and y-coordinate. However, if you take a k dimensional vector and compute the dot product with l mutually orthogonal vectors (where l > k), then any k dot products are enough to reconstruct the original vector.

Rabin has shown how to come up with l vectors of which k are mutually orthogonal.

Sounds familiar. Like my master's thesis. (Score:5, Interesting)

by Saturn49 ( 536831 ) writes: on Wednesday April 26, 2006 @07:01PM (#15208523)

This can be done quite easily with Reed-Solomon coding. In fact, you don't need the majority of the nodes, but simply an arbitrary N set of nodes, with an arbitrary M nodes as redundancy. N=1 and M=1 is basically RAID1. N = n and M = 1 is simply RAID5, N=n and M=2 is RAID 6.

In fact, I wrote a RSRaid driver for Linux for my thesis and did some performance testing on it. I'll save you the 30 pages and just tell you that the algorithm is far too CPU intensive to scale up very well for fileserver use (my original intent,) but I did conclude it could be used as a backup alternative to tape. Hmmmm.

Direct Link [dyndns.org]
Google Cache [72.14.203.104]
Please forgive the double brackets, I fought witH Word and lost.
Contact me if you'd like to play with the code. I never did any reconstruction code, but the system did work in a degraded state, and was written for the Linux 2.6 kernel.

Byzantine for Beginners (Score:3, Interesting)

by jd ( 1658 ) writes: <imipak@yahoGINSBERGo.com minus poet> on Wednesday April 26, 2006 @07:04PM (#15208542) Homepage Journal

The basis of the method lies in the Byzantine General's Problem and related mathematical puzzles. A derivative is used in cryptography for distributed keys. As a backup strategy, it looks interesting - you don't need any higher level of trust than you would need in the Byzantine General's Problem, for exactly the same reasons. This includes not just backup devices but also all connections to backup devices (so you have security against SAN failures, packet corruption and other such problems). The price you pay for this added security and reliability is that it is going to be either extremely slow or more expensive.

Publius (Score:3, Interesting)

by twitter ( 104583 ) writes: on Wednesday April 26, 2006 @07:05PM (#15208554) Homepage Journal

ATT has something like this called Publius [att.com]. Scientific American reviewed it [essential.org] and, in a most unscientific and unAmerican opinion, called it "irresponsible." The goal was not just storage, but publication.
It's nice to see another attempt that's free. Free speech requires anonymity.

RAID 5 at the File Level (Score:3, Interesting)

by kbahey ( 102895 ) writes: on Wednesday April 26, 2006 @07:43PM (#15208745) Homepage

Slashdotted! Can't check the site contents or the wiki.

From the summary : "the software splits every file you backup into small slices, any majority of which can be used to perfectly recreate all the original data."

So, basically it is like RAID 5 striping and parity [wikipedia.org] applied to the file level.

Neat concept.

Re:"any majority of which" (Score:2, Interesting)

by Josmul123 ( 857709 ) writes: on Thursday April 27, 2006 @03:06AM (#15210493)

Aside from RTFM, let me, as a Cleversafe employee, try to explain a bit of what's happening. Cleversafe technology allows for a client-server application where your data to be backed up is sliced up into eleven pieces using our OWN Information Dispersal Algorithm... This is not RSA as some previous posts would lead you to believe. Once the data is split using this algorithm, it is sent out to eleven different sites running our server software. When you want to restore your data (say after recovering from a hard drive failure), you begin downloading your chunks of data, which you cannot access without your private key information. When retrieving your data, up to five of the eleven "dispersed grid" servers can be absolutely unresponsive, and you can still re-assemble your data (similar to .PAR files or RAID5, only with an open-source algorithm created by us). This allows us to have a dispersed storage mechanism with no single point of failure. In actuality, grid nodes could be running different operating systems and be located around the world. A breach at a single point would, assuming someone could decrypt a slice of someone's data (not very easy to do, I'll tell you), allow someone 1/11 of someone's data. For example, you'd be able to know there's a 3 in my credit card number if it was stored on the grid. This makes the technology not only more failsafe (over 99.9% uptime, I believe was the calculation), but also extremely secure.

Re:I think this is wrong again (Score:3, Interesting)

by jabuzz ( 182671 ) writes: on Thursday April 27, 2006 @04:33AM (#15210642) Homepage

No you would be wrong, the RSA algorithm was first described by Clifford Cocks, a British mathematician working for GCHQ in 1973, four years before the description in 1977 by Ron Rivest, Adi Shamir and Len Adleman.

It is a classic example of a bad patent. There was prior art (though admittedly this was kept top secret till 1997) and it also failed the obviousness test. Clearly if someone else came up with the same algorithm four years earlier it was clearly obvious to someone skilled in the art of cryptography. In fact Cocks invented the algorithm "over night" after being told about James H. Ellis (another GCHQ worker) concept of none secret encryption, which occurred to Ellis after reading a paper from World War II by someone at Bell Labs describing a way to protect voice communications by the receiver adding (and then later subtracting) random noise.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Open Source Moving in on the Data Storage World 169

Open Source Moving in on the Data Storage World More Login

Open Source Moving in on the Data Storage World

I don't think you know what that word means. . . . (Score:2, Interesting)

Backup for Backuper? (Score:3, Interesting)

Think RAID5, only way better (Score:5, Interesting)

Rar + Par + BitTorrent? (Score:5, Interesting)

You mean Shamir, not Rabin (Score:5, Interesting)

innovation (Score:2, Interesting)

Virtual file server -- was a program for old Macs (Score:5, Interesting)

Re:Editors, please note! (Score:3, Interesting)

Re:Think RAID5, only way better (Score:4, Interesting)

Sounds familiar. Like my master's thesis. (Score:5, Interesting)

Byzantine for Beginners (Score:3, Interesting)

Publius (Score:3, Interesting)

RAID 5 at the File Level (Score:3, Interesting)

Re:"any majority of which" (Score:2, Interesting)

Re:I think this is wrong again (Score:3, Interesting)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot