New 25x Data Compression? 438

Posted by ScuttleMonkey on Wednesday April 05, 2006 @04:23PM from the make-sure-to-give-it-to-more-than-just-the-corporate-monkies dept.

modapi writes "StorageMojo is reporting that a company at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital data storage. A combination of de-duplication and calculating and storing only the changes between similar byte streams is apparently the key. Imagine storing a terabyte of data on a single disk, and it all runs on Linux." Obviously nothing concrete or released yet so take with the requisite grain of salt.

This discussion has been archived. No new comments can be posted.

New 25x Data Compression?

Load All Comments

Search 438 Comments Log In/Create an Account

Comments Filter:

What kind of data? (Score:4, Insightful)

by Short Circuit ( 52384 ) * writes: <mikemol@gmail.com> on Wednesday April 05, 2006 @04:24PM (#15070237) Homepage Journal

I can create a compression algorithm that compresses my 2GB of data to 1 bit. But it would be crap for any other datastream fed to it.

Share
twitter facebook
- Re:What kind of data? (Score:5, Insightful)
  
  by ivan256 ( 17499 ) * writes: on Wednesday April 05, 2006 @04:27PM (#15070273)
  
  The article says:
  
  it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.
  
  In other words, they're full of crap.
  
  Parent Share
  twitter facebook
  - Re:What kind of data? (Score:4, Insightful)
    
    by slimey_limey ( 655670 ) writes: <slimey@limey.gmail@com> on Wednesday April 05, 2006 @04:33PM (#15070355) Journal
    
    So it can compress its own output? Sweet....
    
    Parent Share
    twitter facebook
    - Re:What kind of data? (Score:5, Informative)
      
      by fyndor ( 895340 ) writes: on Wednesday April 05, 2006 @05:16PM (#15070820)
      
      You hit the nail right on the head. No compression can ever make a statement that it can compress anything by ANY set value, unless the value your talking about is zero :) This would imply that you could compress the output of a compression process and compress it 25 times more. Then take that output and comress it 25 times more. Then take that output... See where I'm going? You could say that MOST files of DATATYPE_X will compress UP TO 25x, but there will always be the exception to the rule. There is no such thing as a free lunch. You can't have infinite compression... but it'd sure be a lot cooler if ya did :)
      
      Parent Share
      twitter facebook
      - Re:What kind of data? (Score:5, Funny)
        
        by networkBoy ( 774728 ) writes: on Wednesday April 05, 2006 @05:35PM (#15071017) Journal
        
        1.
        I can compress anything you give me by a factor of at least 1 (inclusive of my own output).
        
        "-1 pedantic", I know.
        -nB
        
        Parent Share
        twitter facebook
        
        Re:What kind of data? (Score:3, Insightful)
        
        by rw2 ( 17419 ) writes:
        
        1.
        I can compress anything you give me by a factor of at least 1 (inclusive of my own output).
        
        "-1 pedantic", I know.
        
        It would be more pedantic if it were accurate...
      - Re:What kind of data? (Score:4, Funny)
        
        by WhiteWolf666 ( 145211 ) writes: <sherwin.amiran@us> on Wednesday April 05, 2006 @08:10PM (#15072139) Homepage Journal
        
        oOo. Sounds like you are going to find the data singularity.
        
        A single byte that is all other data compressed together, and from which all knowledge flows! The universal black hole of data!
        
        Don't tell me .... is this a new MS Vista technology?
        
        Parent Share
        twitter facebook
      - Actually, I once tried that. (Score:5, Interesting)
        
        by MickLinux ( 579158 ) writes: on Wednesday April 05, 2006 @08:28PM (#15072221) Journal
        
        I once used a Huffman data compression algorithm, recursively, in order to see just how much compression I could get. The first round, I got maybe 75% compression on the data I was using. The second round, I got 10%. The third round, I got 3%. The fourth, I got 1%; and after that, I'd typically actually increase the size of the data slightly. Let's not forget that I am including the size of the initial data table.
        
        So then I tried it with LZW compression, and it still eventually grew in size.
        
        The neat thing about doing this, though, is that it taught me something about the mathematical basis for entropy. You see, I couldn't believe that I was getting the diminishing returns, so I wrote some algorithms to output the histogram curves.
        
        What I saw was that the best Huffman compression came when the Histogram was farthest from what I'll call a "perfect bell curve". I don't know if that is the same curve or not, but it looks a lot like one half of a perfect bell; or maybe like the radiation output of a blackbody in physics.
        
        Anyhow, as I successively compressed the data, the data moved towards a tighter bell curve in general, and always towards that perfect bell, in specific (so long as the data would compress, that is.) I didn't do the calculation, but it would be interesting to calculate what the closest bell curve was, and then do a standard deviation of the histogram from the bell curve, and correlate it to compression.
        
        So then I thought "well, I'll compress only a portion of the data, the part that is compressible". But any typical portion of the data still seemed to follow that pesky bell curve. So then I thought to intercept the data, and see if I could visually spot any patterns.
        
        Indeed, I could. Wow -- look at that string of zeros here; and that repeated series 1001001001001, *four times*, there. Surely I could get compression out of that. Funny thing, though. Every time I tried, I could get compression for that data set, but then lousy compression for anything else. When I tried to generalize the compression to include every possibility, I again couldn't get compression. In other words, truly entropic data does have repetition. It does have some item that shows up more commonly than others. It does have patterns. But the patterns are no more than what you would expect, (or actually, if you want to be correct but confusing, only an expectable percentage of the patterns are more than what you would expect, by any given amount.) And when you include all the patterns of length n, including patterns of length n=1, then there just isn't any more entropy possible for the data.
        
        And just as it takes an increase in entropy to drive a heat engine (2nd law of thermo), it also takes an increase in data entropy to get compression.
        
        Parent Share
        twitter facebook
        
        You geek! (Score:3, Funny)
        
        by thepotoo ( 829391 ) writes:
        
        Sheesh...when did you last get laid?
    - - Re:Compression hoax number 3 (Score:3, Insightful)
        
        by bluephone ( 200451 ) writes:
        
        News media around the world carried the "news" of the Raelians cloning a little girl. The vast majority of intelligent people knew it was crap, most average peopel assumed it was crap, the news media all said to take it with a grain of salt and that they could secure no no proof. News is news, whether it's news of a real advance, or news that a potentially reliable source is making astounding claims. Only through the analysis of these claims can knowledge grow.
  - Re:What kind of data? (Score:5, Funny)
    
    by swimboy ( 30943 ) writes: on Wednesday April 05, 2006 @04:40PM (#15070430)
    
    It can compress anything! At the demo, I saw them compress 25 oz. of snake oil so that it all fit in a 1 oz. jar!
    
    Parent Share
    twitter facebook
    - Re:What kind of data? (Score:3, Funny)
      
      by morcheeba ( 260908 ) * writes:
      
      I've heard you have to do the decompression carefully, though -- If you do it too quickly, you just end up making a big mess.
  - Re:What kind of data? (Score:5, Funny)
    
    by TheNetAvenger ( 624455 ) writes: on Wednesday April 05, 2006 @06:06PM (#15071297)
    
    In other words, they're full of crap.
    
    But the Slashdot Post says that is all runs on Linux. And knowing the infinite power of Linux, I believe them.
    
    In addition to being the best OS in the world, Linux is also the most secure, does everything better than every other OS, and if given the right developers it is the ONLY os that could do something as impressive as compress data past the limits of possiblity.
    
    I'm sure with the right developer, Linux could also be used to harness zero point energy, create wormholes for travel in your basement, and possibly cure most diseases... /wink
    
    Parent Share
    twitter facebook
  - - Well that's not surprising. (Score:5, Informative)
      
      by Ayanami Rei ( 621112 ) * writes: <rayanamiNO@SPAMgmail.com> on Wednesday April 05, 2006 @08:15PM (#15072165) Journal
      
      That's called the law of large numbers.
      Systems like this bank on the fact that most enterprise backup systems (that is... Veritas) can't tell when a file is changed slightly between backups. They use a coarser-grained whole-file approach (which is very reliable though, and already only stores one copy of each file). But people who know about the magic of rsync understand the speedups that can be obtained by leveraging rolling hashtables and other tricks to get binary deltas of large files, and only transmitting those changes.
      Given a large enough set of backups and enough time, the potential size savings is enormous.
      
      Veritas should really be implementing this themselves, though.
      
      And I have a feeling this is what's behind the 25x claims of the article. The key is the mention "enterprise"... large data sets... lots of potential redundancy to exploit.
      
      Parent Share
      twitter facebook
- Re:What kind of data? (Score:2, Insightful)
  
  by Hao Wu ( 652581 ) writes:
  
  Like saying that a library card is the same thing as a library.
- /dev/zero ? (Score:5, Funny)
  
  by slimey_limey ( 655670 ) writes: <slimey@limey.gmail@com> on Wednesday April 05, 2006 @04:31PM (#15070313) Journal
  
  dd if=/dev/zero bs=1m count=1m | lzop - | gzip -f -| gzip -f - | gzip -f - | wc
  
  gives about three kilobytes for a terabyte of data.
  
  Parent Share
  twitter facebook
  - Re:/dev/zero ? (Score:2)
    
    by tigersha ( 151319 ) writes:
    
    What a waste of all the millions of fine engineering man-hours spent at AMD and Intel...
- Re:What kind of data? (Score:5, Insightful)
  
  by devjoe ( 88696 ) writes: on Wednesday April 05, 2006 @04:52PM (#15070545)
  
  Well, there's an idea here that might hold some truth. Note that they are marketing it to data centers, people with LOTS and LOTS of files. Because people tend to have multiple copies of the same files, they can achieve great compression by eliminating the duplicate copies in the archive -- or likewise, any files with large sections that are the same among various files.
  
  20 email accounts subscribed to the same mailing list? Store the bodies of those e-mails only once, and you save a big chunk of disk space. A bunch of people downloaded the same MP3 file? We only need one copy in the archive. As long as there are multiple copies of the same data, it can compress any type of data.
  
  The difference here is that they are taking advantage of the redundancy of files across an entire filesystem (and a HUGE one), rather than the redundancies within an individual file. (I would assume they also do the latter type of compression with a conventional algorithm.) 25x compression seems extreme, but I am sure they can achieve some extra compression here.
  
  Parent Share
  twitter facebook
  - Re:What kind of data? (Score:2)
    
    by a_nonamiss ( 743253 ) writes:
    
    Sounds like the same thing that AMANDA [amanda.org] has been doing since 1997.
    
    What is old is new again.
  - Linux has something related.. RZIP (Score:3, Interesting)
    
    by Convergence ( 64135 ) writes:
    
    Most compression programs uses a very limited context. gzip cannot identify and exploit redundancy if it occurs more than 32kb or 64kb apart. bzip2 uses a blocksize of 900kb, and it too cannot identify redundancy more than 900kb apart. rzip [samba.org] however uses a context of 900MB, so it can exploit redundancy within a file, even if it occurs hundreds of megabytes apart.
    Although its not for every file, some times, this can be a huge win. In my case, backing up 60 versions of a 700kb XML file, I get 500:1 co
- Re:What kind of data? (Score:5, Informative)
  
  by tverbeek ( 457094 ) writes: on Wednesday April 05, 2006 @05:00PM (#15070629) Homepage
  
  I just fed Diligent Technology some bogus personal data and downloaded their brochure, and as far as I can tell from a quick gleaning, they achieve these impossible compression ratios across multiple versions of the same data set. So your initial full backup will be compressed at mathematically-possible-in-this-universe ratios, and your subsequent incremental backups - which only store the changes compared to the previous backup - will (with typical data scenarios) be much smaller. It's incremental backups on the byte level, basically.
  So they're not exactly lying about the compression ratios, they're just redefining the term to describe compression not of data-sets but of data-sets-over-time.
  
  Parent Share
  twitter facebook
Breaking news! (Score:3, Insightful)

by ivan256 ( 17499 ) * writes: on Wednesday April 05, 2006 @04:25PM (#15070241)

Company breaks Shannon Limit. Debunking at 11!

Seriously though. Gzip can compress down to 98%... if your data is mostly redundant. The chance that they're doing this on the random data they claim in the article is nil.

Share
twitter facebook
- Re:Breaking news! (Score:2)
  
  by zalas ( 682627 ) writes:
  
  This reminds me of the various funny antics people try to claim in comp.compression (for instance, being able to compress everything to something smaller and totally ignoring the pigeon-hole principle) and also of the recent Euclid Discoveries's claim regarding superior quality parametric encoding of video.
- dd if=/dev/urandom of=file bs=10MB count=1 (Score:2)
  
  by vlad_petric ( 94134 ) writes:
  
  compress that :)
- Re:Breaking news! (Score:5, Funny)
  
  by nizo ( 81281 ) * writes: on Wednesday April 05, 2006 @04:35PM (#15070372) Homepage Journal
  
  Maybe it is lossy compression, which would be really nice when compressing executables and old spreadsheets.
  
  Parent Share
  twitter facebook
  - Re:Breaking news! (Score:3, Funny)
    
    by x2A ( 858210 ) writes:
    
    If only they had this a few years ago during the enron mess, they could have claimed "we didn't fiddle the accounts, we just saved it using lossy compression techniques".
    
    Just like the "our intelligence wasn't wrong about Sadam having WMD's, the satalite images just come to us as lossy JPEGs"
    
    (the point of this post lost due to compression)
- Re:Breaking news! (Score:2)
  
  by alexhs ( 877055 ) writes:
  
  Exactly. And as it is targeted to large data centers, I wonder if they didn't implement some sort of sparse files. Compressing large chunks of 0's sure give you impressive compression ratios...
- - Re:Breaking news! (Score:5, Interesting)
    
    by Austerity Empowers ( 669817 ) writes: on Wednesday April 05, 2006 @04:54PM (#15070569)
    
    His point is that the Shannon limit provides a mathematical upper bound for how good a lossless compression algorithm can be for arbitrary data sets. gzip gets 98% of that maximum bound, so any algorithm that claims to be 12x that is either not lossless, or not generic. Gzip etc. are all based on several related algorithms known generally as "entropy coders" (http://en.wikipedia.org/wiki/Entropy_coding [wikipedia.org]).
    
    Lossy compression and compression of particular data sets do not have to obey this. With lossy compression you can compress down as far as you can tolerate.
    
    Coding particular sets gets some extra compression by coding some of the data in the compress/decompress utility. For example if all your files have a 1MB standard header and 1KB of data, you can omit the 1MB of header because it's always there, and just send the 1KB of data! Truly amazing compression! Of course it only works under those conditions.
    
    Parent Share
    twitter facebook
    - Re:Breaking news! (Score:3, Insightful)
      
      by jthill ( 303417 ) writes:
      
      If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?
      - Re:Breaking news! (Score:3, Interesting)
        
        by evilviper ( 135110 ) writes:
        
        If gzip gets 98% of what's possible, then what the hell are bzip2 and 7zip doing?
        Despite the obvious answer (he's simply wrong), 7zip is somewhat "cheating" in this 3-way comparison, as it uses a much, much, much larger block-size (memory). You can set it to use hundreds of MBs of RAM, whereas gzip and bzip2 are both limited to 9KB max.
        .
        
        Off-topic Rant:
        I was actually quite impressed with 7zip and it's lzma/ppmd compression methods when I first saw it compressing better than bzip2. However, once the novelty
- - Re:Breaking news! (Score:3, Insightful)
    
    by Savantissimo ( 893682 ) writes:
    
    Your example can be compressed to the minimal algorithm for the pseudorandom number generator you used plus the seed it used to produce your data.
*sniff* (Score:5, Insightful)

by bryanp ( 160522 ) writes: on Wednesday April 05, 2006 @04:25PM (#15070245)

*sniff* *sniff* *sniff*

I smell ... vapor.

Share
twitter facebook
- Re:*sniff* (Score:3)
  
  by darkmeridian ( 119044 ) writes:
  
  My bad.
Limited application (Score:5, Funny)

by Locke2005 ( 849178 ) writes: on Wednesday April 05, 2006 @04:25PM (#15070247)

Yes, it can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.

Share
twitter facebook
- Re:Limited application (Score:5, Funny)
  
  by Bull999999 ( 652264 ) writes: on Wednesday April 05, 2006 @04:27PM (#15070275) Journal
  
  I, too, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
  
  Parent Share
  twitter facebook
  - Re:Limited application (Score:3, Funny)
    
    by LNO ( 180595 ) writes:
    
    I, as well, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
    - Re:Limited application (Score:3, Funny)
      
      by Alien Being ( 18488 ) writes:
      
      Wow, *your* algorithm even compresses the moderation!
    - Re:Limited application (Score:5, Funny)
      
      by sprag ( 38460 ) writes: on Wednesday April 05, 2006 @05:00PM (#15070626)
      
      I, as well, welcome our 1/25th of original size overlords... but it only works on hot grits articles, which are highly compressable due to the large amount of petrified data.
      
      Parent Share
      twitter facebook
    - Re:Limited application (Score:2)
      
      by Feanturi ( 99866 ) writes:
      
      I would like to mention that in addition to the others who have voiced their opinion on this subject that likewise, I.. Oh crap.
    - Re:Limited application (Score:4, Funny)
      
      by tshak ( 173364 ) writes: on Wednesday April 05, 2006 @05:08PM (#15070725) Homepage
      
      I, wanting cheap karma, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
      
      Parent Share
      twitter facebook
      - Re:Limited application (Score:4, Funny)
        
        by networkBoy ( 774728 ) writes: on Wednesday April 05, 2006 @05:44PM (#15071101) Journal
        
        dude, karma whoring funny comments is approaching the usefulness of this compression algo.
        hate to break it to you this way :-)
        -nB
        
        Parent Share
        twitter facebook
      - Re:Limited application (Score:5, Funny)
        
        by complete loony ( 663508 ) writes: <(moc.liamg) (ta) (namekaL.ymereJ)> on Wednesday April 05, 2006 @08:36PM (#15072263)
        
        I, forgetting that funny doesn't give karma, can compress data to 1/25th of original size... but it only works on slashdot articles, which are highly compressable due to the large amount of redundant data.
        
        Parent Share
        twitter facebook
    - - Re:Limited application (Score:2)
        
        by Jason Scott ( 18815 ) writes:
        
        compress25
        
        Re:Limited application (Score:2)
        
        by stupidfoo ( 836212 ) writes:
        
        c25
- Re:Limited application (Score:3, Insightful)
  
  by sprag ( 38460 ) writes:
  
  Wouldn't you get 1/50th since is seems like every other story is a dupe.
Heard this before (Score:5, Interesting)

by Jordan Catalano ( 915885 ) writes: on Wednesday April 05, 2006 @04:27PM (#15070268) Homepage

Does anyone else remember a "state-of-the-art" fractal compression program that appeared back around 95 or so? It was very impressive at first - you'd compress a four meg file down into a few kilobytes, and it would decompress just fine afterwards... until you deleted the original file. Turns out the program only stored a pointer to the location of the original file on the drive in its output file. I bet more than one person, after thinking they had verified it worked, lost some valuable data.

Share
twitter facebook
- Re:Heard this before (Score:2)
  
  by chrismcdirty ( 677039 ) writes:
  
  I don't remember it from the time, but in a post on an inferior "technology" news site last week, there was another bogus compression story in which someone brought up a fractal compression program. But this one moved the original file to a hidden location, and gave you a trojan at the same time! Talk about efficiency!
- Re:Heard this before (Score:2)
  
  by bmwm3nut ( 556681 ) writes:
  
  yeah, i got a copy of that back in the windows 3.1 days, so it was pre 95. i remember that is was shareware and it said that if you bought the full version it would decompress the file if you deleted the original. i never believed it, and if it did really work, then it'd be around today.
  - Re:Heard this before (Score:2)
    
    by Ex Machina ( 10710 ) writes:
    
    I seem to remember it being in the TigerDirect catalog! :) some things never change
- Re:Heard this before (Score:4, Interesting)
  
  by Orgasmatron ( 8103 ) writes: on Wednesday April 05, 2006 @07:09PM (#15071721)
  
  Yup, that was OWS. You actually could delete the original file, but once it got overwritten, or if it wasn't available, you couldn't deOWS it any more.
  
  Back in the day, I figured out what was going on when I took a disk to another machine, couldn't restore the file. I then tested the disk in the machine I had made the archive on, and it worked fine. It was a good hoax. We all got a good laugh out of it.
  
  Parent Share
  twitter facebook
- - Re:Heard this before - OWS (Score:2, Informative)
    
    by CAR912 ( 788234 ) writes:
    
    This [faqs.org] seems good, otherwise Google for "ows compression OR compress OR compressor [google.com]", and according to this [melbpc.org.au], OWS stands for the author's initials.
The proof... (Score:5, Funny)

by jforest1 ( 966315 ) writes: on Wednesday April 05, 2006 @04:28PM (#15070283)

It's true! It compressed my 10GB collection of ASCII PR0N into 1 meg!

Share
twitter facebook
- Re:The proof... (Score:4, Funny)
  
  by Dynedain ( 141758 ) writes: <(slashdot2) (at) (anthonymclin.com)> on Wednesday April 05, 2006 @05:03PM (#15070664) Homepage
  
  The ASCII results:
  
  *
  
  Parent Share
  twitter facebook
  - MOD PARENT DOWN (Score:5, Funny)
    
    by gEvil (beta) ( 945888 ) writes: on Wednesday April 05, 2006 @05:35PM (#15071018)
    
    Mod parent down! Nobody needs to see goatse again...
    
    Parent Share
    twitter facebook
Grain of salt (Score:2)

by GillBates0 ( 664202 ) writes:

Obviously nothing concrete or released yet so take with the requisite grain of salt.
Or atleast with 1/25th a grain of salt.
- - Re:Grain of salt (Score:2)
    
    by GillBates0 ( 664202 ) writes:
    
    No, because 25x compression would reduce the size of a hypothetical grain of salt 1/25 times.
right. sure. (Score:3, Interesting)

by Doktor Memory ( 237313 ) writes: on Wednesday April 05, 2006 @04:30PM (#15070296) Journal

Number of companies claiming a breakthrough in compression technology since the release of bzip2: too many to count.

Number of them which were anything other than complete bullshit: 0

I'm not holding my breath.

Share
twitter facebook
This post is sooo full of BS (Score:2)

by Khyber ( 864651 ) writes:

They say it will work on anything? Sorry, I don't think so. I can take 2 gigs of straight 0's and compress it into a file with table and it only be maybe kilobyte in size. But, given technology and greed today, I doubt we're breaking the Shannon Limit anytime soon.
- You can do better than that. (Score:2)
  
  by bigtallmofo ( 695287 ) writes:
  
  I can take 2 gigs of straight 0's and compress it into a file with table and it only be maybe kilobyte in size.
  
  Without putting much thought into it, I can even do that. 2 gigs of straight 0's with a real-world algorithm pretty easily compresses down to 12 bytes, far fewer than the kilobyte you quote. You could store it in just: 2000000000x0
  
  Use an abbreviation for 2 billion or other byte-saving tricks and you could compress it down even more.
  
  I suspect such smoke and mirrors is something similar to
  - Re:You can do better than that. (Score:2)
    
    by Directrix1 ( 157787 ) writes:
    
    You must be referring to Run Length Encoding (RLE).
Currently.. (Score:2)

by Douglas Simmons ( 628988 ) writes:

25 times what? A 25th of the original file? Does it matter if it's already compressed or is it the same on anything? How does bzip stack up on a text file, yo?
Incomplete Article Summary (Score:5, Funny)

by bigtallmofo ( 695287 ) writes: on Wednesday April 05, 2006 @04:31PM (#15070323)

The summary should have read...

StorageMojo is reporting that a company named Practical Nano Cold Fusion Duke Nukem Forever at Storage Networking World in San Diego has made a startling claim of 25x data compression for digital...

Share
twitter facebook
Dubious (Score:5, Insightful)

by pilkul ( 667659 ) writes: on Wednesday April 05, 2006 @04:32PM (#15070339)

Stuff like new compression algorithms generally comes out in academic papers, which are then applied in practice by regular programmers. That's what happened with the Burrows-Wheeler algorithm at the core of bzip2. Some company concerned with mostly implementation rather than theory wouldn't come up with a revolutionary advance. The writeup is very vague, but it sounds to me like they're just using a simple LZ type algorithm, and they're only claiming 25x compression if the data is mostly the same already. Well duh.

Share
twitter facebook
- Re:Dubious (Score:2)
  
  by glassware ( 195317 ) writes:
  
  Sounds to me like they're a backup company, and they're achieving 25:1 when backing up Windows servers by skipping all the redundant DLLs. Sounds like the author of this article mistook a real company with ridiculous claims about their backup performance for a magic new algorithm.
- Re:Dubious (Score:2)
  
  by tcopeland ( 32225 ) * writes:
  
  > That's what happened with the Burrows-Wheeler algorithm
  
  The Burrows Wheeler Transform is very cool indeed. Brian Ewins used it to make the PMD duplicate code detector [sourceforge.net] much much faster.
sounds like a O(n^n^n) problem. (Score:5, Interesting)

by Ancient_Hacker ( 751168 ) writes: on Wednesday April 05, 2006 @04:33PM (#15070348)
Couple "issues":
- The cost of disk space versus the cost in computer time in finding all the matching substrings. Disk space gets bigger a whole lot faster and easier than CPUs speed up, so even if this idea is economically feasible today, it can only get worse from here.
- This scheme may work just swell with some data streams, but probably pathologically awful with others. A good example: a billion empty records in a database might be compressed to a very few bytes. The system operator relaxes, and lets a log file fill up the rest of the disk. Then a bunch of database records need to be added, or the existing records need some sequential numbering added and guess what? There's no space for the new records, or to expand the existing ones. Argh.
Share
twitter facebook
- Re:sounds like a O(n^n^n) problem. (Score:2)
  
  by ignorant_newbie ( 104175 ) writes:
  
  >The system operator relaxes, and lets a log file fill up the rest of the disk.
  
  If your logs are on the same partition (let alone _disk_) as your database files, you deserve this kind of fate.
Shame on you, ScuttleMonkey! (Score:4, Funny)

by RobertB-DC ( 622190 ) * writes: on Wednesday April 05, 2006 @04:35PM (#15070373) Homepage Journal

Posted by ScuttleMonkey on Wed Apr 05, '06 03:23 PM
from the make-sure-to-give-it-to-more-than-just-the- corporate-monkies dept.

You would think that an editor called Scuttle Monkey would know that the correct plural of "Monkey" is "Monkeys", not "Monkies".

"Monkies" would be the plural of "Monkie", which I guess is what you'd call a baby Monk Seal [wheelock.edu], or if you knew him really well, a resident of a Monastery [wikipedia.org]. "Hey, Monkie, nice robe!"

Of course, if you were talking to Michael Nesmith [wikipedia.org], the singular form would be "Monkee". But that's neither here nor there.

Share
twitter facebook
- Re:Shame on you, ScuttleMonkey! (Score:2)
  
  by Carthag ( 643047 ) writes:
  
  It could also be the plural of Monky, perhaps?
No, really, it's true! (Score:2)

by dreamchaser ( 49529 ) writes:

Seriously. I hear that they are going to use it with Duke Nukem Forever to fit all the map and texture data onto only 22 DVD's.
A grain of salt? (Score:2)

by thewiz ( 24994 ) * writes:

Obviously nothing concrete or released yet so take with the requisite grain of salt.

Actually, I'd say take the news of this "breakthrough" with a Salt Lick. [wikipedia.org]

I hope it's true, but I'm not holding my breath.
- Re:A grain of salt? (Score:2)
  
  by nizo ( 81281 ) * writes:
  
  If you can't afford your own salt lick, you can probably find one just lying around on the ground at a cattle farm. As an added bonus it is probably chock full of growth hormones and random cow medicines, so enjoy eating it!
Calgary / Canterbury corpus? (Score:4, Interesting)

by Spy der Mann ( 805235 ) writes: <spydermann DOT slashdot AT gmail DOT com> on Wednesday April 05, 2006 @04:36PM (#15070387) Homepage Journal

If they can't compress the canterbury corpus [compression.ca] or calgary corpus [compression.ca] beyond 3X, then it's a SCAM.

Share
twitter facebook
Sad truths about data compression. (Score:5, Informative)

by k.a.f. ( 168896 ) writes: on Wednesday April 05, 2006 @04:38PM (#15070405)

1. There can be no algorithm that can compress every stream by a constant factor, let alone by 25. Whoever says otherwise is mistaken or lying.

2. Achievable compression depends on the nature of the input material. Big files (music, movies) these days are already compressed by their respective codecs, so they compress really badly.

3. While there are algorithms that, on average, compress better than others, usually this is paid for by running slower, often much, much slower.

Mmmmmmh, salt.

Share
twitter facebook
OSHI! (Score:2)

by TheRealMindChild ( 743925 ) writes:

de-duplication and calculating and storing only the changes between similar byte streams is apparently the key

Maybe you want to tale a gander at RLE [wikipedia.org]
25x compression for something repeated 25 times (Score:2, Insightful)

by demon411 ( 827680 ) writes:

Yup, let me just add to others saying that 25x compression is impossible for arbitrary data. It's just an indexing problem, if you have a 2 kbyte files (2^12288 possible permutation) it is impossible to map all to the (2000/25=) 82 byte files (2^656 possible permutations). Good thing the article talks about what data this applies to...(sarcasm)
Where have we heard this one before? (Score:4, Insightful)

by overshoot ( 39700 ) writes: on Wednesday April 05, 2006 @04:41PM (#15070440)

Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.
A cow-orker asked if it could be used on its own ouput.

Share
twitter facebook
- Re:Where have we heard this one before? (Score:2)
  
  by Prospero's Grue ( 876407 ) writes:
  
  Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.
  Yeah, it's called rm, isn't it? You can even use the flags '-r' for recursive (compress the compression for even more savings) and '-f' for flatten (makes the result occupy even less space than before). Run rm -rf from the root directory and just watch how much disk space frees up. Amazing!
  - Re:Where have we heard this one before? (Score:2)
    
    by ADRA ( 37398 ) writes:
    
    Nah, I'd rather see this recursion:
    
    rm -f /bin/rm
- Re:Where have we heard this one before? (Score:2)
  
  by moochfish ( 822730 ) writes:
  
  Once upon a time, my VP bought into a firm that had discovered a guaranteed-perfect compression algorithm: it would reduce the size of any data file, no exceptions.
  
  A cow-orker asked if it could be used on its own ouput.
  
  Answer: Sure! But decompressing the data is still under development.
I've always imagined this conversation (Score:5, Funny)

by jfengel ( 409917 ) writes: on Wednesday April 05, 2006 @04:43PM (#15070461) Homepage Journal

Developers: We've got some really good ideas for reducing backup space by using compression and incremental backups.

Marketing: How much in the best conceivable case?

Developers: Oh, I dunno, maybe 25x.

Marketing: 25x? Is that good?

Developers: Yeah, I suppose, but the cool stuff is...

Marketing: Wow! 25x! That's a really big number!

Developers: Actually, please don't quote me on that. They'll make fun of me on Slashdot if you do. Promise me.

Marketing: We promise.

Developers: Thanks. Now, let me show you where the good stuff is...

Marketing (on phone): Larry? It's me. How big can you print me up a poster that says "25x"?

Share
twitter facebook
- Re:I've always imagined this conversation (Score:3)
  
  by spun ( 1352 ) writes:
  
  Someone please mod this "insightful" as opposed to funny (which it also is.) Does anyone doubt that this is pretty much how it happened?
  
  Comedian Bill Hicks had the most insightful proposal for marketing types:
  
  "By the way if anyone here is in advertising or marketing... kill yourself. No, no, no it's just a little thought. I'm just trying to plant seeds. Maybe one day, they'll take root - I don't know. You try, you do what you can. Kill yourself. Seriously though, if you are, do. Aaah, no really, there's no
damn people! (Score:2)

by rhaig ( 24891 ) writes:

RTFA.

of course you can do this. Look at datadomain.com.

they expect 20-80x compression because they're marketing themselves as backup to disk (doing repetitive full backups). you get the same patterns over and over again.

and whoever posted the RLE wikipedia article, thank you for understanding the solution.

and no, everything isn't going to compress 25x, but everything will compress some. There are repeated bitstreams in everything. a 64bit string has a finite number of patterns. I don't know how small th
- Re:damn people! (Score:2)
  
  by C_Kode ( 102755 ) writes:
  
  they expect 20-80x compression because they're marketing themselves as backup to disk (doing repetitive full backups). you get the same patterns over and over again.
  
  Hmm, I don't like the thought of all my backups utilizing a single copy of a pattern that happens a million times. Imagine; You have 30 days of backups, and a single pattern occurs 25,000 times between all 30 backups. You get block errors where that single pattern exist on the disk there by destorying all 30 backups. Now, I can understand ke
What's that smell in the air? Oh yeah, Bullshit. (Score:2)

by Senjutsu ( 614542 ) writes:

Further, since the software operates on byte-streams, it can compress anything: email, databases, archives, mp3's, encrypted data or whatever weird data format your favorite program uses.

It can compress anything!1111 Even already compressed mp3s and encrypted data, both of which have a high degree of data entropy, and are essentially uncompressible!

Magical compression for everyone!!
This definitely works (Score:5, Funny)

by All Names Have Been ( 629775 ) writes: on Wednesday April 05, 2006 @04:45PM (#15070487)

I can tell you, this technology definitely works. I've seen them compress random data streams to 1/25th (even 1/30th!!) their size. This works *TODAY*. Coming out real soon now is the software that allows you to decompress your data. This is still in development.

Share
twitter facebook
Great job Slashdot... (Score:2)

by X ( 1235 ) writes:

Sigh, this is nothing more than a non-redundant store. Very similar to stuff already offered by a number of vendors, even Microsoft. The "fast way to know what's already on disk" is just to store hashes of the data in an index. Move along, nothing to see here......
Vist the Diligent WebSite and learn.... (Score:5, Informative)

by sherpajohn ( 113531 ) writes: on Wednesday April 05, 2006 @04:54PM (#15070568) Homepage

....I mean jeez. They are not in the file compression business, they are in the "data protection" business. Specifically disk based backup. They make NO cliam regarding "data compression" - the 25X claim is explicitly in regards to the disk space required to backup data. What they say is that using their solution can lead to a 25x less disk space requirement for backups. It may involve some new compression algorithms, but appears to be more based on never backing up the same data more than once.

Share
twitter facebook
- Re:Vist the Diligent WebSite and learn.... (Score:3, Insightful)
  
  by noidentity ( 188756 ) writes:
  
  Now that sounds more reasonable. Instead of putting the incremental backup smarts on the client side, put it on the server side. This way the client can use whatever old scheme is handy, perhaps a plain file copy, and let the server sort out the redundancy with data already copied previously. Only the server has to contain the complex algorithms, so there's less of an opportunity for screw-ups.
  
  That blog entry smells artificial, though. Very calculated. Right about here, I become wary:
  
  "The way Diligent achie
Results of Search in 1976-present db for: (Score:2)

by Rogerborg ( 306625 ) writes:

"diligent technologies": 0 patents. [uspto.gov]
TFA (Score:4, Insightful)

by pcosta ( 236434 ) writes: on Wednesday April 05, 2006 @05:01PM (#15070639)

If everybody stopped laughing and actually RTFA, they aren't claiming 25x compression on anything. The algorithm is targeted at data backup, i.e. very large files and works by comparing incoming data patterns to patterns already stored. Looks like a modification of LZH that uses the compressed file as the pattern table. I'm not saying that it works or that is a breakthrough, but they are not claiming impossible lossless compression on anything. It might actually be interesting for the application it was designed for.

Share
twitter facebook
To those who're wondering... (Score:3, Insightful)

by TrumpetPower! ( 190615 ) writes: <ben@trumpetpower.com> on Wednesday April 05, 2006 @05:02PM (#15070643) Homepage

If you're wondering why this is pure bullshit, this might help.

Lossless compression is nothing more than an algorithmic lookup table. It's a substitution cipher like what you find in famous quote puzzles.

Take two different messages. Compress each. When you decompress them, you have to get two different messages back, right? So you need two different messages in compressed form. If your compressed message uses the same symbolic representation as the uncompressed message--and, since we're talking ones and zeros here with computers, that's exactly the case--then it should quickly be apparent that, for any given length message, there're so many possible permutations of symbols to create a message...and you need exactly that same number of permutations in compressed form to be able to re-create any possible message.

Compression is handy because we tend to restrict ourselves to a tiny subset of the possible number of messages. If you have a huge library but only ever touch a small handful of books, you only need to carry around the first drawer of the first card cabinet. You can even pretend that the other umpteen hundred drawers don't even exist.

It's the same with text. You only need six bytes to store most of the frequently-used characters in text, but we sometimes use a lot more than just the standard characters so they get written on disk using eight bytes each. English doesn't even use every permutation of two-letter words, let alone twenty-letter ones, so there's a lot of wasted space there. You only need about eighteen bits to store enough positions for every word in the dictionary. A good compression algorithm for text will make that kind of a look-up table optimized for written English at the expense of other kinds of data. ``The'' would be in the first drawer of the cabinet, but ``uyazxavzfnnzranghrrt'' wouldn't be listed at all. If you actually wrote ``uyazxavzfnnzranghrrt'' in your document, the compression algorithm would fall back to storing it in its uncompressed form.

Also, don't overlook the overhead of the data of the algorithm itself. If you've got a program that could compress a 100 Mbyte file down to 1 Mbyte...but the compression software itself took several gigabytes of space, that ain't gonna do you much good. It's sometimes helpful to think of it in terms of the smallest self-contained program that could create the desired output. An infinite number of threes is easy; just divide 1 by three. Pi is a bit more complex, but only just. The complete works of Shakespeare is going to have a lot more overhead for a pretty short message. And ``uyazxavzfnnzranghrrt'' might even have so much overhead for such a short message that ``compression'' just makes it bigger.

Cheers,

b&

Share
twitter facebook
Reminds me of "fractal compression." (Score:2)

by erroneus ( 253617 ) writes:

I remember years ago there was this horrible "joke" program. It claimed to compress files down to some amazingly small sizes. You could "compress" the file, then erase it, and "expand" the compressed file and it seemed to work just fine! It was done by recording the sectors on disk that a file occupied. So yeah, you can delete it and "restore" it... but try emailing that compressed file? Or expanding it a week later!

The description of the process sounds pretty good, but then again, so too does the medi
4000:1 compression (Score:2)

by AYeomans ( 322504 ) writes:

There's a very simple way to get much better compression - simply store the SHA-256 hash of every file instead. My average file size is about 126 Kbyte, so that's a 4000:1 compression.

OK, OK, you still have to store a full version of each file (or a traditionally compressed version). So for a single PC it doesn't make sense. But for an enterprise there are thousands of copies of those Windows OS files, tens or hundreds of those Powerpoint presentations, scatter-gun emails, etc - so why not just store them j
For Christ's sake, Slashdot editors (Score:2)

by osgeek ( 239988 ) writes:
Please add "startling data compression" to a list of filters for obviously bullshit articles that have no business even getting attention from Slashdot. The people who submit the articles are either complete suckers or Google AdSense whores. The list should also contain:
- flying automobile/car
- holographic data storage
- Duke Nuke'em Forever
- perpetual motion engine
Lossless and Reliable? (Score:2)

by sbaker ( 47485 ) * writes:

It's certainly possible (for some types of data) to perform LOSSY compression down to 25:1 - but this system is a backup system...you don't want lossy compression in a backup system!! So let's assume these guys are talking lossless compression.

The best current compression algorithms for English text come close to 10:1 lossless compression - so there is hope that their system could do that good.

Even simple run-length encoding will manage spectacular compression ratios well over 100:1 on images that are diag
Might work for typical back-up (Score:3, Informative)

by porttikivi ( 93246 ) * writes: on Wednesday April 05, 2006 @05:53PM (#15071189)

The article talks about backup. The idea could be, that instead of managing incremental backups you just optimize compression of data that is similar to old data. In that way you can do "full" backups, but actually save only incremental backup worth of data.

See http://en.wikipedia.org/wiki/Venti [wikipedia.org] for similar ideas in a system that easily achives 25x compression for typical archival storage. When a file has been changed only those 512 kbyte blocks that are really new are saved, other blocks are just mapped by their SHA1 hashes to existing blocks. So files with small changes, very similar files and files sharing common parts will all compress very nicely. In a multi-user system the files of different users tend to also have lots of similar parts: same emails, same office documents with perhaps minor changes, same reference material / tools / libraries as personal copies etc.

My guess is TFA refers to a re-invention of this wheel, most likely in an inferior way.

Share
twitter facebook
Entirely possible (Score:3, Informative)

by Coward Anonymous ( 110649 ) writes: on Wednesday April 05, 2006 @06:18PM (#15071392)

This is entirely possible and they are not the only ones doing it, for example http://www.datadomain.com/ [datadomain.com] has been doing it for a while. The big storage vendors do it to some extent as well.
The idea is based on "de-duplication" of data and is only really practical for backups (where most data from backup to backup is identical) or central repositories of data for a large organization that has multiple similar data sets, for example, many installations of Windows that are often similar.
From my experience x25 is a bold claim for general data. I've seen small scale tests that showed x30 compression over backup sets but those implementations had performance issues.
From the description in their white-paper, despite their claims, it appears they are performing some kind of hash by definition (e.g. mapping a space to a smaller space).

Share
twitter facebook
it's a CVS!! (Score:4, Informative)

by TheLoneCabbage ( 323135 ) writes: on Thursday April 06, 2006 @04:47AM (#15074304) Homepage

This is a back up system, not a single file compression (although for framed data like video, email, etc.. the compression scheme is still clever).

Basically it's a CVS, if your backing up multiple computers, or user directories your going to see tons of repeate files, heck they'll even be the same name. Saving the diffs is a good idea. And not at all dificult to duplicate.

For instance what if you were doing back up for a team of animators. Their files are HUGE, but 90% of the frames will be identical between the individual systems. (indeed the frames between one another will likely be very similar) You could get far more than 25x compression that way. The big downside of this idea is the memmory & CPU vs Speed trade off. You can't use this kind of system to back up to a tape or DVD system, it needs to be random access media.

You could probably get nearly the same results by hacking rsync and diffing identical file names in different directories. Possible bonus for diffing files of similar file type.

It's a clever idea, not a radical new technology.

Share
twitter facebook
- Re:100X - 1000X (Score:4, Informative)
  
  by irritating environme ( 529534 ) writes: on Wednesday April 05, 2006 @05:00PM (#15070628)
  
  This is completely false. There are fundamental mathematical limits to the amount you can compress data in a lossless format. In fact, each compression format ususally has overhead on the file to store the mapping data to decode/decompress it. That overhead+the compressed file is usually less than the original file, until you run the compressor once or twice. Then the file doesn't compress at all, and the compression record overhead actually increases the overall file size.
  
  Parent Share
  twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

What kind of data? (Score:4, Insightful)

Re:What kind of data? (Score:5, Insightful)

Re:What kind of data? (Score:4, Insightful)

Re:What kind of data? (Score:5, Informative)

Re:What kind of data? (Score:5, Funny)

Re:What kind of data? (Score:3, Insightful)

Re:What kind of data? (Score:4, Funny)

Actually, I once tried that. (Score:5, Interesting)

You geek! (Score:3, Funny)

Re:Compression hoax number 3 (Score:3, Insightful)

Re:What kind of data? (Score:5, Funny)

Re:What kind of data? (Score:3, Funny)

Re:What kind of data? (Score:5, Funny)

Well that's not surprising. (Score:5, Informative)

Re:What kind of data? (Score:2, Insightful)

/dev/zero ? (Score:5, Funny)

Re:/dev/zero ? (Score:2)

Re:What kind of data? (Score:5, Insightful)

Re:What kind of data? (Score:2)

Linux has something related.. RZIP (Score:3, Interesting)

Re:What kind of data? (Score:5, Informative)

Breaking news! (Score:3, Insightful)

Re:Breaking news! (Score:2)

dd if=/dev/urandom of=file bs=10MB count=1 (Score:2)

Re:Breaking news! (Score:5, Funny)

Re:Breaking news! (Score:3, Funny)

Re:Breaking news! (Score:2)

Re:Breaking news! (Score:5, Interesting)

Re:Breaking news! (Score:3, Insightful)

Re:Breaking news! (Score:3, Interesting)

Re:Breaking news! (Score:3, Insightful)

*sniff* (Score:5, Insightful)

Re:*sniff* (Score:3)

Limited application (Score:5, Funny)

Re:Limited application (Score:5, Funny)

Re:Limited application (Score:3, Funny)

Re:Limited application (Score:3, Funny)

Re:Limited application (Score:5, Funny)

Re:Limited application (Score:2)

Re:Limited application (Score:4, Funny)

Re:Limited application (Score:4, Funny)

Re:Limited application (Score:5, Funny)

Re:Limited application (Score:2)

Re:Limited application (Score:2)

Re:Limited application (Score:3, Insightful)

Heard this before (Score:5, Interesting)

Re:Heard this before (Score:2)

Re:Heard this before (Score:2)

Re:Heard this before (Score:2)

Re:Heard this before (Score:4, Interesting)

Re:Heard this before - OWS (Score:2, Informative)

The proof... (Score:5, Funny)

Re:The proof... (Score:4, Funny)

MOD PARENT DOWN (Score:5, Funny)

Grain of salt (Score:2)

Re:Grain of salt (Score:2)

right. sure. (Score:3, Interesting)

This post is sooo full of BS (Score:2)

You can do better than that. (Score:2)

Re:You can do better than that. (Score:2)

Currently.. (Score:2)

Incomplete Article Summary (Score:5, Funny)

Dubious (Score:5, Insightful)

Re:Dubious (Score:2)

Re:Dubious (Score:2)

sounds like a O(n^n^n) problem. (Score:5, Interesting)

Re:sounds like a O(n^n^n) problem. (Score:2)

Shame on you, ScuttleMonkey! (Score:4, Funny)

Re:Shame on you, ScuttleMonkey! (Score:2)

No, really, it's true! (Score:2)

A grain of salt? (Score:2)

Re:A grain of salt? (Score:2)

Calgary / Canterbury corpus? (Score:4, Interesting)

Sad truths about data compression. (Score:5, Informative)

OSHI! (Score:2)

25x compression for something repeated 25 times (Score:2, Insightful)

Where have we heard this one before? (Score:4, Insightful)

Re:Where have we heard this one before? (Score:2)

Re:Where have we heard this one before? (Score:2)

Re:Where have we heard this one before? (Score:2)

sniff (Score:5, Insightful)

Re:sniff (Score:3)