sethjn - Slashdot User

Comment Not the same thing, but related. (Score 1) 324

by sethjn on Monday August 14, 2006 @11:34AM (#15903047) Attached to: Compress Wikipedia and Win AI Prize

Well, while it isn't the same thing this guys is looking for, I have created a new semi-random access compression format specifically because I was having to process Wikipedia data. A friend of mine and I were working on some data mining in Wikipedia and found that uncompressing the 300 Gig file to be unrealistic. So I created RAZ (Random Access Zip) to help us out. It is a very simple system. It simply uses another compressor to compress relatively small chunks, strings them together, and prepends a header with index information. Using normal 7-zip, we get the 300 gigabytes down to about 2.1 GB. Using the RAZ format using 7-Zip, we get it down to about 2.4. We've written a Python module that "opens" the file and allows for random access through seek and read. I'm eventually going to put RAZ on source forge when time allows. For now, I've just written a post about it on my blog (sethnielson.com). The code is still experimental, so I haven't posted it, but you can email me if you're interested in it and I'll send you a copy. -- Seth N.

Slashdot Top Deals