Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×

First Hutter Prize Awarded 191

stefanb writes, "The Hutter Prize for Lossless Compression of Human Knowledge, an ongoing challenge to compress a 100-MB excerpt of the Wikipedia, has been awarded for the first time. Alexander Ratushnyak managed to improve the compression factor to 5.86 and will receive a 3,416-Euro award. Being able to compress knowledge well is believed to be related to acting intelligently." The Usenet announcement notes that at Ratushnyak's request, part of the prize will go to Przemyslaw Skibinski of the University of Wroclaw Institute of Computer Science, for his early contributions to the PAQ compression algorithm.
This discussion has been archived. No new comments can be posted.

First Hutter Prize Awarded

Comments Filter:
  • by Salvance ( 1014001 ) * on Sunday October 29, 2006 @10:17PM (#16637656) Homepage Journal
    It turns out that Alexander Ratushnyak was the only person to even meet the challenge's guidelines, and one of only 3 people who submitted entries. Seems like a strange thing to award up to 50,000 Euro for ... his compression was only 20% smaller than what I just quickly did with gzip. I'm rather surprised that more people didn't try though ... it's not like people are throwing money at data compression researchers.
  • by Raul654 ( 453029 ) on Sunday October 29, 2006 @10:24PM (#16637722) Homepage
    There aren't many Wikipedia related things I don't know about (I'm a Wikipedia administrator, arbitrator, and bureacrat, featured article director, and I'm on the Wikimedia Foundation's Press Committee), and this is the first time I've ever heard of this contest. I think it's fair to say it's been relatively low-profile.
  • Lookup table? (Score:0, Interesting)

    by Anonymous Coward on Sunday October 29, 2006 @10:53PM (#16637974)
    I wonder if you could just store all the common english words into a lookup table indexed by a number (32 bit enough?). So every word then gets replaced by a number which would then be compressed further. Only catch is the client needs a copy of the lookup table which might be feasible.
  • Done on 9/25/06? (Score:3, Interesting)

    by SillySnake ( 727102 ) on Monday October 30, 2006 @01:02AM (#16638629)
    According to http://prize.hutter1.net/ [hutter1.net] this happened on Sept 25 of 2006.

    The site also gives some of the requirements..
    Create a compressed version (self-extracting archive) of the 100MB file enwik8 of less than 17MB. More precisely:

            * Create a Linux or Windows executable of size S L := 17'073'018 = previous record.
            * If run, it produces (without input from other sources) a 108 byte file that is identical to enwik8.
            * If we can verify your claim, you are eligible for a prize of 50'000×(1-S/L). Minimum claim is 500.
            * Restrictions: Must run in 10 hours on a 2GHz P4 with 1GB RAM and 10GB free HD.
  • by adrianmonk ( 890071 ) on Monday October 30, 2006 @02:58AM (#16639191)
    How does a computer algorithm for compression relate to the idea that human compression of knowledge corresponds to intelligent action? I think the summary is making a mistake in terminology. An intelligent person is able to compress information and access it quickly, but that doesn't really have much to do with a more efficient compression algorithm, unless we're talking some seriously advanced AI. I just really don't understand the correlation or what the summary is trying to imply...

    Nope, I had heard about the contest before seeing it today on Slashdot. The summary is fine. That is in fact the thesis that the contest is designed to investigate.

    As for why this is the case, I spent some time studying compression two or three years ago, and one of the things that you quickly realize is that there is no such thing as a truly general-purpose compression algorithm. Compression algorithms work on the principle that there is some underlying pattern or structure to your data. Once the structure is understood, the data can be transformed into a smaller format. Think of compressed files as a set of parameters to a magic function which generates a larger file. The real art is in finding the function.

    To give a concrete example, one of the simplest forms of compression is Huffman coding. You analyze your stream of symbols, and you realize that some occur more often than others. You then assign shorter bit strings to more frequently-occurring symbols and longer ones to less frequent symbols. This gives you a net gain. You were able to do this because you had the insight that in human language, some letters (like "e") occur more frequently than others (like "q").

    There are, of course, other patterns that can be exploited. You can take the frequency distribution trick above up to another level by noting that the frequency of a symbol is not really independent of the symbol before it. For example, the most likely symbol after a "q" is "u". Sure, you could have other things. But "u" is the most likely. You can exploit that to get better compression.

    But of course, you might be compressing something other than English text. There are different techniques for whatever kind of data you're trying to compress. A row of pixels in a faxed (1-bit) image tends to be very close to the same as the previous row. Each pixel is a bit. If you represent a row of pixels not as on or off directly but instead represent it as its bit value XORed with the pixel above it, you end up with 0 bits for every pixel that hasn't changed and 1 bits for every value that hasn't. Presto, you have just managed to skew the probability distribution radically towards 0 bits, and you can use further tricks to gain compression from that.

    The point of all this is that to come up with these tricks, you have to understand something about the data. One of the better definitions I've ever heard for data compression was simply "applied modeling".

    Given that, you have to ask a question: have we reached the point today where compression algorithms are as good as they're going to get at compressing English text based on its surface-level characteristics such as character frequencies and repeated strings? Have we exhausted the low-level modeling that is possible? There has been a lot of work on this, so it's possible that we may have. And if so, then any further gains in the compression ratio could very well be the result of some sort of higher-level modeling. Maybe even modeling the knowledge rather than just the language. And that is what this contest is about, as I understand it.

    It's not for sure that a winning contest entry necessarily requires the submitter to have developed some kind of machine intelligence. But it's an intriguing enough idea that it might be worthwhile to run the contest just to see if something interesting does happen.

    By the way, for some interesting reading on this subject, look up Kolmogorov Complexity. The wikipedia seems to have a pretty decent article [wikipedia.org] on it (although I haven't read the whole thing).

Work without a vision is slavery, Vision without work is a pipe dream, But vision with work is the hope of the world.

Working...