Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×

New 25x Data Compression? 438

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.
This discussion has been archived. No new comments can be posted.

New 25x Data Compression?

Comments Filter:
  • What kind of data? (Score:4, Insightful)

    by Short Circuit ( 52384 ) * <mikemol@gmail.com> on Wednesday April 05, 2006 @04:24PM (#15070237) Homepage Journal
    I can create a compression algorithm that compresses my 2GB of data to 1 bit. But it would be crap for any other datastream fed to it.
  • Breaking news! (Score:3, Insightful)

    by ivan256 ( 17499 ) * on Wednesday April 05, 2006 @04:25PM (#15070241)
    Company breaks Shannon Limit. Debunking at 11!

    Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.
  • *sniff* (Score:5, Insightful)

    by bryanp ( 160522 ) on Wednesday April 05, 2006 @04:25PM (#15070245)
    *sniff* *sniff* *sniff*

    I smell ... vapor.
  • by ivan256 ( 17499 ) * on Wednesday April 05, 2006 @04:27PM (#15070273)
    The article says:

    it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.

    In other words, they're full of crap.
  • by Hao Wu ( 652581 ) on Wednesday April 05, 2006 @04:30PM (#15070302) Homepage
    Like saying that a library card is the same thing as a library.
  • Dubious (Score:5, Insightful)

    by pilkul ( 667659 ) on Wednesday April 05, 2006 @04:32PM (#15070339)
    Stuff like new compression algorithms generally comes out in academic papers, which are then applied in practice by regular programmers. That's what happened with the Burrows-Wheeler algorithm at the core of bzip2. Some company concerned with mostly implementation rather than theory wouldn't come up with a revolutionary advance. The writeup is very vague, but it sounds to me like they're just using a simple LZ type algorithm, and they're only claiming 25x compression if the data is mostly the same already. Well duh.
  • by slimey_limey ( 655670 ) <slimey.limey@[ ]il.com ['gma' in gap]> on Wednesday April 05, 2006 @04:33PM (#15070355) Journal
    So it can compress its own output? Sweet....
  • by demon411 ( 827680 ) on Wednesday April 05, 2006 @04:40PM (#15070434)
    Yup, let me just add to others saying that 25x compression is impossible for arbitrary data. It's just an indexing problem, if you have a 2 kbyte files (2^12288 possible permutation) it is impossible to map all to the (2000/25=) 82 byte files (2^656 possible permutations). Good thing the article talks about what data this applies to...(sarcasm)
  • by overshoot ( 39700 ) on Wednesday April 05, 2006 @04:41PM (#15070440)
    Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.

    A cow-orker asked if it could be used on its own ouput.

  • by sprag ( 38460 ) on Wednesday April 05, 2006 @04:46PM (#15070492)
    Wouldn't you get 1/50th since is seems like every other story is a dupe.
  • by devjoe ( 88696 ) on Wednesday April 05, 2006 @04:52PM (#15070545)
    Well, there's an idea here that might hold some truth. Note that they are marketing it to data centers, people with LOTS and LOTS of files. Because people tend to have multiple copies of the same files, they can achieve great compression by eliminating the duplicate copies in the archive -- or likewise, any files with large sections that are the same among various files.

    20 email accounts subscribed to the same mailing list? Store the bodies of those e-mails only once, and you save a big chunk of disk space. A bunch of people downloaded the same MP3 file? We only need one copy in the archive. As long as there are multiple copies of the same data, it can compress any type of data.

    The difference here is that they are taking advantage of the redundancy of files across an entire filesystem (and a HUGE one), rather than the redundancies within an individual file. (I would assume they also do the latter type of compression with a conventional algorithm.) 25x compression seems extreme, but I am sure they can achieve some extra compression here.
  • TFA (Score:4, Insightful)

    by pcosta ( 236434 ) on Wednesday April 05, 2006 @05:01PM (#15070639)
    If everybody stopped laughing and actually RTFA, they aren't claiming 25x compression on anything. The algorithm is targeted at data backup, i.e. very large files and works by comparing incoming data patterns to patterns already stored. Looks like a modification of LZH that uses the compressed file as the pattern table. I'm not saying that it works or that is a breakthrough, but they are not claiming impossible lossless compression on anything. It might actually be interesting for the application it was designed for.
  • by TrumpetPower! ( 190615 ) <ben@trumpetpower.com> on Wednesday April 05, 2006 @05:02PM (#15070643) Homepage
    If you're wondering why this is pure bullshit, this might help.

    Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.

    Take two different messages. Compress each. When you decompress them, you have to get two different messages back, right? So you need two different messages in compressed form. If your compressed message uses the same symbolic representation as the uncompressed message--and, since we're talking ones and zeros here with computers, that's exactly the case--then it should quickly be apparent that, for any given length message, there're so many possible permutations of symbols to create a message...and you need exactly that same number of permutations in compressed form to be able to re-create any possible message.

    Compression is handy because we tend to restrict ourselves to a tiny subset of the possible number of messages. If you have a huge library but only ever touch a small handful of books, you only need to carry around the first drawer of the first card cabinet. You can even pretend that the other umpteen hundred drawers don't even exist.

    It's the same with text. You only need six bytes to store most of the frequently-used characters in text, but we sometimes use a lot more than just the standard characters so they get written on disk using eight bytes each. English doesn't even use every permutation of two-letter words, let alone twenty-letter ones, so there's a lot of wasted space there. You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text will make that kind of a look-up table optimized for written English at the expense of other kinds of data. ``The'' would be in the first drawer of the cabinet, but ``uyazxavzfnnzranghrrt'' wouldn't be listed at all. If you actually wrote ``uyazxavzfnnzranghrrt'' in your document, the compression algorithm would fall back to storing it in its uncompressed form.

    Also, don't overlook the overhead of the data of the algorithm itself. If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good. It's sometimes helpful to think of it in terms of the smallest self-contained program that could create the desired output. An infinite number of threes is easy; just divide 1 by three. Pi is a bit more complex, but only just. The complete works of Shakespeare is going to have a lot more overhead for a pretty short message. And ``uyazxavzfnnzranghrrt'' might even have so much overhead for such a short message that ``compression'' just makes it bigger.

    Cheers,

    b&
  • Re:Breaking news! (Score:3, Insightful)

    by jthill ( 303417 ) on Wednesday April 05, 2006 @05:36PM (#15071031)
    If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?
  • Re:Breaking news! (Score:3, Insightful)

    by Savantissimo ( 893682 ) on Wednesday April 05, 2006 @06:06PM (#15071300) Journal
    Your example can be compressed to the minimal algorithm for the pseudorandom number generator you used plus the seed it used to produce your data.
  • by noidentity ( 188756 ) on Wednesday April 05, 2006 @08:26PM (#15072213)
    Now that sounds more reasonable. Instead of putting the incremental backup smarts on the client side, put it on the server side. This way the client can use whatever old scheme is handy, perhaps a plain file copy, and let the server sort out the redundancy with data already copied previously. Only the server has to contain the complex algorithms, so there's less of an opportunity for screw-ups.

    That blog entry smells artificial, though. Very calculated. Right about here, I become wary:

    "The way Diligent achieves it exceptional compression ratio is by comparing all incoming data to the data already arrived. When it finds an incoming stream of bytes similar to an existing series of bytes it compares the two and stores the differences. The magic comes in a couple of areas, as near as I can make out given Neville's natural reticence on the "how" of the technology.

    First, one has to be smart about how big the series of bytes before worrying about trying to compess it, since if it's too short there won't be much or any compression. Secondly, the system needs a very fast and efficient method of knowing what is has already received so it can know when it is receiving something similar. And it all has to be optimized to run in-line at data rate speeds on a standard server box -- which runs the cool and reliable Linux OS."

  • by rw2 ( 17419 ) on Wednesday April 05, 2006 @08:52PM (#15072354) Homepage
    1.
    I can compress anything you give me by a factor of at least 1 (inclusive of my own output).

    "-1 pedantic", I know.


    It would be more pedantic if it were accurate...
  • by moultano ( 714440 ) on Wednesday April 05, 2006 @09:18PM (#15072483)
    Guess what? It is IMPOSSIBLE to create a generic compression algorithm. Gzip operates by doing exactly what you mention: operating on a particular set of data: that being data with some exploitable redundancy. There are plenty of files that will get bigger when you give them to Gzip.

    Entropy coders work by making assumptions about the probability distribution of the data they recieve. They assume they are working on a set of data in which certain types of data are more likely than others, so they store those more compactly, but as a result they HAVE to store others less compactly. No matter how you slice it, you can not store more than 2^n unique strings in n bits. The only gains you can make are by assuming that you aren't going to be dealing with all possible strings, and compacting the ones that you care about.

    That may have actually been what you meant, but I really didn't want anyone reading that to get the impression that there was something magical about entropy that made it a different approach than narrowing the set of data you are storing. The two are fundamentally the same thing.
  • by bluephone ( 200451 ) <greyNO@SPAMburntelectrons.org> on Wednesday April 05, 2006 @10:38PM (#15072870) Homepage Journal
    News media around the world carried the "news" of the Raelians cloning a little girl. The vast majority of intelligent people knew it was crap, most average peopel assumed it was crap, the news media all said to take it with a grain of salt and that they could secure no no proof. News is news, whether it's news of a real advance, or news that a potentially reliable source is making astounding claims. Only through the analysis of these claims can knowledge grow.
  • Re:Breaking news! (Score:1, Insightful)

    by Anonymous Coward on Thursday April 06, 2006 @07:22AM (#15074660)
    Er, no.

    gzip has a 32KB sliding window (so it can see up to 32KB of the file at once to look for redundancies). It uses the deflate algorithm, which is basic LZ77 compression coupled with Huffman encoding. The entropy encoding step could be improved with arithmetic coding, because Huffman requires at least 1 bit per output symbol, but arithmetic encoding can represent several symbols with just one bit. PATENTS on arithmetic coding are what stops that from happening in free software. gzip could swap arithmetic coding in place of Huffman coding TODAY and IMMEDIATELY get better compression. It chooses not to do so, to avoid even the HINT of patent infringement.

    bzip2 has a configurable 100KB to 900KB (in steps of 100KB) window, so at most it can see 900KB to look for redundancies. However, it also matches and encodes redundancies in an entirely different way (BWT+MTF+RLE instead of LZ77). The final step is again Huffman, which can't beat arithmetic coding!

    lmza is once again back to the LZ77 style of redundancy searching. True, its window can potentially be huge (2^28 = 256MB, although the Wikipedia page says this can go to 2^32 = 4GB now), it is doing basically the gzip algorithm, but with much better designed data structures, so it can search for matches much faster. Also, gzip uses a hash-based matching algorithm (it saves memory - it's state-of-the-art for 1991 when most home machines had 1-16MB RAM total) which can miss several potential matches. LZMA uses a trie, which stores ALL potential matches... certainly using more memory, but gives better compression. The REAL enhancement to compression is use of a Range coder (the "Markov" part of the acronym). This is almost IDENTICAL to the banned/patented arithmetic coding, giving the much greater entropy level compression over Huffman. Certainly, there's a legal risk in case some unexpected clause in the arithmetic coding patents covers range coding, but most people don't think so.

    So the basic answer is that gzip/bzip2 refuse to use the potentially patent infringing methods that lzma dares to use. LZMA also uses a more accurate matching algorithm. That's why it's better. Sure, increasing the window size will definitely find more redundancies (if they're there to find), but the bad matching algorithm in deflate means that most of the gains would be lost.

    Remember: gzip/bzip2 are not used for peak data compression! They're used because they're carefully designed to avoid ALL known compression patents, to stick to PUBLIC DOMAIN PRIOR ART. This is done to avoid all lawsuits! That's the design behind them. Best compression is an afterthought.

Thus spake the master programmer: "After three days without programming, life becomes meaningless." -- Geoffrey James, "The Tao of Programming"

Working...