New 25x Data Compression? 438
modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.
What kind of data? (Score:4, Insightful)
Breaking news! (Score:3, Insightful)
Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.
*sniff* (Score:5, Insightful)
I smell
Re:What kind of data? (Score:5, Insightful)
it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.
In other words, they're full of crap.
Re:What kind of data? (Score:2, Insightful)
Dubious (Score:5, Insightful)
Re:What kind of data? (Score:4, Insightful)
25x compression for something repeated 25 times (Score:2, Insightful)
Where have we heard this one before? (Score:4, Insightful)
A cow-orker asked if it could be used on its own ouput.
Re:Limited application (Score:3, Insightful)
Re:What kind of data? (Score:5, Insightful)
20 email accounts subscribed to the same mailing list? Store the bodies of those e-mails only once, and you save a big chunk of disk space. A bunch of people downloaded the same MP3 file? We only need one copy in the archive. As long as there are multiple copies of the same data, it can compress any type of data.
The difference here is that they are taking advantage of the redundancy of files across an entire filesystem (and a HUGE one), rather than the redundancies within an individual file. (I would assume they also do the latter type of compression with a conventional algorithm.) 25x compression seems extreme, but I am sure they can achieve some extra compression here.
TFA (Score:4, Insightful)
To those who're wondering... (Score:3, Insightful)
Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.
Take two different messages. Compress each. When you decompress them, you have to get two different messages back, right? So you need two different messages in compressed form. If your compressed message uses the same symbolic representation as the uncompressed message--and, since we're talking ones and zeros here with computers, that's exactly the case--then it should quickly be apparent that, for any given length message, there're so many possible permutations of symbols to create a message...and you need exactly that same number of permutations in compressed form to be able to re-create any possible message.
Compression is handy because we tend to restrict ourselves to a tiny subset of the possible number of messages. If you have a huge library but only ever touch a small handful of books, you only need to carry around the first drawer of the first card cabinet. You can even pretend that the other umpteen hundred drawers don't even exist.
It's the same with text. You only need six bytes to store most of the frequently-used characters in text, but we sometimes use a lot more than just the standard characters so they get written on disk using eight bytes each. English doesn't even use every permutation of two-letter words, let alone twenty-letter ones, so there's a lot of wasted space there. You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text will make that kind of a look-up table optimized for written English at the expense of other kinds of data. ``The'' would be in the first drawer of the cabinet, but ``uyazxavzfnnzranghrrt'' wouldn't be listed at all. If you actually wrote ``uyazxavzfnnzranghrrt'' in your document, the compression algorithm would fall back to storing it in its uncompressed form.
Also, don't overlook the overhead of the data of the algorithm itself. If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good. It's sometimes helpful to think of it in terms of the smallest self-contained program that could create the desired output. An infinite number of threes is easy; just divide 1 by three. Pi is a bit more complex, but only just. The complete works of Shakespeare is going to have a lot more overhead for a pretty short message. And ``uyazxavzfnnzranghrrt'' might even have so much overhead for such a short message that ``compression'' just makes it bigger.
Cheers,
b&
Re:Breaking news! (Score:3, Insightful)
Re:Breaking news! (Score:3, Insightful)
Re:Vist the Diligent WebSite and learn.... (Score:3, Insightful)
That blog entry smells artificial, though. Very calculated. Right about here, I become wary:
"The way Diligent achieves it exceptional compression ratio is by comparing all incoming data to the data already arrived. When it finds an incoming stream of bytes similar to an existing series of bytes it compares the two and stores the differences. The magic comes in a couple of areas, as near as I can make out given Neville's natural reticence on the "how" of the technology.
First, one has to be smart about how big the series of bytes before worrying about trying to compess it, since if it's too short there won't be much or any compression. Secondly, the system needs a very fast and efficient method of knowing what is has already received so it can know when it is receiving something similar. And it all has to be optimized to run in-line at data rate speeds on a standard server box -- which runs the cool and reliable Linux OS."
Re:What kind of data? (Score:3, Insightful)
I can compress anything you give me by a factor of at least 1 (inclusive of my own output).
"-1 pedantic", I know.
It would be more pedantic if it were accurate...
NO COMPRESSOR IS GENERIC!!! (Score:2, Insightful)
Entropy coders work by making assumptions about the probability distribution of the data they recieve. They assume they are working on a set of data in which certain types of data are more likely than others, so they store those more compactly, but as a result they HAVE to store others less compactly. No matter how you slice it, you can not store more than 2^n unique strings in n bits. The only gains you can make are by assuming that you aren't going to be dealing with all possible strings, and compacting the ones that you care about.
That may have actually been what you meant, but I really didn't want anyone reading that to get the impression that there was something magical about entropy that made it a different approach than narrowing the set of data you are storing. The two are fundamentally the same thing.
Re:Compression hoax number 3 (Score:3, Insightful)
Re:Breaking news! (Score:1, Insightful)
gzip has a 32KB sliding window (so it can see up to 32KB of the file at once to look for redundancies). It uses the deflate algorithm, which is basic LZ77 compression coupled with Huffman encoding. The entropy encoding step could be improved with arithmetic coding, because Huffman requires at least 1 bit per output symbol, but arithmetic encoding can represent several symbols with just one bit. PATENTS on arithmetic coding are what stops that from happening in free software. gzip could swap arithmetic coding in place of Huffman coding TODAY and IMMEDIATELY get better compression. It chooses not to do so, to avoid even the HINT of patent infringement.
bzip2 has a configurable 100KB to 900KB (in steps of 100KB) window, so at most it can see 900KB to look for redundancies. However, it also matches and encodes redundancies in an entirely different way (BWT+MTF+RLE instead of LZ77). The final step is again Huffman, which can't beat arithmetic coding!
lmza is once again back to the LZ77 style of redundancy searching. True, its window can potentially be huge (2^28 = 256MB, although the Wikipedia page says this can go to 2^32 = 4GB now), it is doing basically the gzip algorithm, but with much better designed data structures, so it can search for matches much faster. Also, gzip uses a hash-based matching algorithm (it saves memory - it's state-of-the-art for 1991 when most home machines had 1-16MB RAM total) which can miss several potential matches. LZMA uses a trie, which stores ALL potential matches... certainly using more memory, but gives better compression. The REAL enhancement to compression is use of a Range coder (the "Markov" part of the acronym). This is almost IDENTICAL to the banned/patented arithmetic coding, giving the much greater entropy level compression over Huffman. Certainly, there's a legal risk in case some unexpected clause in the arithmetic coding patents covers range coding, but most people don't think so.
So the basic answer is that gzip/bzip2 refuse to use the potentially patent infringing methods that lzma dares to use. LZMA also uses a more accurate matching algorithm. That's why it's better. Sure, increasing the window size will definitely find more redundancies (if they're there to find), but the bad matching algorithm in deflate means that most of the gains would be lost.
Remember: gzip/bzip2 are not used for peak data compression! They're used because they're carefully designed to avoid ALL known compression patents, to stick to PUBLIC DOMAIN PRIOR ART. This is done to avoid all lawsuits! That's the design behind them. Best compression is an afterthought.