Forgot your password?
typodupeerror

Compress Wikipedia and Win AI Prize 324

Posted by CmdrTaco
from the what-does-this-mean dept.
Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."
This discussion has been archived. No new comments can be posted.

Compress Wikipedia and Win AI Prize

Comments Filter:
  • Painful to read (Score:4, Insightful)

    by CuriHP (741480) on Sunday August 13, 2006 @06:54PM (#15899793)
    For the love of god, proofread!

  • lossy compression (Score:5, Insightful)

    by RenoRelife (884299) on Sunday August 13, 2006 @06:55PM (#15899798)
    Using the same data lossy compressed, with an algorithm that was able to permute data in a similar way to the human mind, seems like it would come closer to real intelligence than the lossless compression would
  • by Fred Porry (993637) on Sunday August 13, 2006 @07:08PM (#15899831)
    Then it would be an encyclopedia, not a Wiki, thats another point why I say: forget about it. Would be nice though. ;)
  • by blueadept1 (844312) on Sunday August 13, 2006 @07:20PM (#15899875)
    That is actually an interesting idea. What if you added a layer of compression that converted every possible common acronym, made contractions, etc...
  • Re:Painful to read (Score:2, Insightful)

    by Threni (635302) on Sunday August 13, 2006 @07:24PM (#15899889)
    > For the love of god, proofread!

    Yeah, I just read the write-up twice and have no idea if this is an AI contest, something to do with compression, or what. In fact, all I can remember now is the word "incentivize" which is the sort of thing I expect some bullshit salesman at work to say.
  • by CastrTroy (595695) on Sunday August 13, 2006 @07:25PM (#15899896) Homepage
    Well, since it's currently only 1 Gig, you could probably put it on a flash card and read it from a handheld. It wouldn't be an ipod. but probably wouldn't require destroying a perfectly good piece of equipment either. You could probably even get weekly updates (hopefully in a diff file) to make sure your copy is in sync with the rest of the internet. Now that I think about it, this would be a really good application. There's lots of times when I'd like to look up something off wikipedia, but not connected to the internet.
  • by Anonymous Coward on Sunday August 13, 2006 @07:49PM (#15899959)
    Funny? That's most intelligent and insightful remark I've seen here in months, albeit rather naively stated.
    The human brain is a fuzzy clustering algorithm, that's what neural networks do, they reduce the space of a large
    data set by eliminating redundancy and mapping only the salient features of it onto a smaller data set, which in bio systems
    is the weights for triggering sodium/potassium spikes at a given junction. If such a thing existed a neural compression algorithm would be capable of immense data reduction. The downsides are that, like us humans, they may be unreliable/non-deterministic in retrieving data because of this fuzzyness. It would also be able make "sideways" associations and draw inferences from the data set, which in essence would be a weak form of artificial intelligence. Now give him his +5 insightful you silly people.
  • by Vo0k (760020) on Sunday August 13, 2006 @07:54PM (#15899974) Journal
    that's one piece, but not necessarily - "lossy" nature of human mind compression can be overcome by "additional checks".

    Lossy relational/dictionary based compression is the base. You hardly ever remember text by order of letters or sound of voice reading it. You remember the meaning of a sentence, plus optionally some rough pattern (like voice rhythm) to reproduce the exact text from rewording the meaning. So you understand meaning of sentence, store it as relation to known meanings (pointers to other entries!) then when recalling, you put it back in words, and for exact citation you try to match possible wordings against remembered pattern.

    So imagine this: the compressor analyzes a sentence lexically, spots regularities and irregularities, transforms the sentence into a relational set of tokens containing the meaning of the sentence, which are small and easy to store, unambigiously describe the meaning of the sentence but don't contain exact wording. Then an extra checksum of the text of the sentence is added.

    Decompressor tries to build a sentence that reflects given idea according to rules of grammar, picking different synonyms, word ordering and such, and then brute-forcing or otherwise matching against the checksum to find which of the created sentences matches exactly.

    Look, the best compressor in the world:
    sha1sum /boot/vmlinuz
    647fb0def3809a37f001d26abe58e7f900718c46 /boot/vmlinuz

    Linux kernel compressed to set: { string: "647fb0def3809a37f001d26abe58e7f900718c46", info: "it's a Linux kernel for i386" }

    You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically. So theoretically you could use the hash to rebuild THE kernel just by brute-force creating random files and checking them if they match both the hash and "general properties of a kernel".

    The problem obviously lies in unrealistic "brute force" part. The subset of possible rebuilds of the data must be heavily limited. You can do this by lossy compression that allows for limited "informed guess" results - ones that make sense in context of a linux kernel - style, compiler optimizations, use of macros transformed into repeatable binary code. And have the original analysed using the same methods before compression, storing all inconsistencies with the model separately.

    So the compression file would consist of:
    - a set of logical tokens describing meaning of given piece of data (in relation to the rest of "knowledge base"
    - a set of exceptions (where logic fails)
    - a checksum or other pattern allowing to verify an exact match.

    Most of lossy compressions are meant to obfuscate the lost data. If you use one that instead allows for rebuilding lost data according to certain limited ruleset ("informed guess" + verification) you'd get a lossless compression of comparable efficiency.
  • by Anonymous Coward on Sunday August 13, 2006 @08:03PM (#15900005)
    If you're unfamiliar with the current state of the art in a field that's under intensive study, it's a near guarantee that your new ideas either don't work or are already a very basic part of the current technology. There's really no shortcut to doing the background reading and understanding how things are currently done before setting out to improve them. If you want a starting point in this area of study, here's one [wikipedia.org]; once you've gotten through that you can start on the remaining 22 years' worth of research.
  • by Anonymous Coward on Sunday August 13, 2006 @08:21PM (#15900062)
    The premise of this contest shows a remarkable ignorance of compression theory and technology. True, in theory the more knowledge one has about the world, the better one should be able to compress, but only in the most abstract sense. In practice the vast majority of the bits any "image" of the world like a picture or a piece of text are almost wholly unrelated to high-level features of the world, and thus compression algorithms rightly focus on low-level features of the "image" that are almost wholly separate from the features AI researchers care about.

    For example, suppose one wishes to compress two consecutive pictures of a chess tournament. Does knowing the rules of chess help much? Sure, one might use knowledge of chess to deduce a few bits (literally) of information about the second image (what move was made), but when trying to compress two 10-megabit images, who cares? Much better to focus on low-level features such as world smoothnesss, lighting variations, motion flow, et cetera.

    Similarly for text and speech. How much does understanding the topic of conversation help? Not much, compared with the knowledge of the most recent several words, which is why all good compression (and prediction) algorithms essentially ignore "understanding" and focus on carefully calculating the mixing of probabilities derived from low-frequency observations.

    The winner of a Wiki-compression contest is going to be a variation of an ordinary text compression algorithm, and will use techniques that do not in any obvious way translate to "smart" applications, in the same way that the highly successful N-gram models of speech recognition and machine translation do not have any properties one would normally associate with intelligence.

    One can build an "intelligent" compressor, but it's the LAST thing any compression researcher is going to do, because they know high-level intelligence doesn't have many bits of information to provide.

  • by Ignis Flatus (689403) on Sunday August 13, 2006 @08:22PM (#15900071)
    I think the original premise is wrong. Real world intelligence is not lossless. The algorithms only have to be right most of the time to be effective. And our intelligence is incredibly redundant. If you want robust AI, you're going to have to accept redundancy and imperfection. Same goes for data transmission. Sure, you compress, but then you also add in self-error correcting codes with a level on redundancy based on the known reliability of the network.
  • Re:Painful to read (Score:3, Insightful)

    by ameline (771895) <ian.ameline@gmailMOSCOW.com minus city> on Sunday August 13, 2006 @10:49PM (#15900515) Homepage Journal
    Agreed -- why can they not MOTIVATE us instead?

    No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.

  • by giafly (926567) on Sunday August 13, 2006 @10:57PM (#15900538)
    The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids.
    A "Hutter AI" will be at a disadvantage when competing against an opponent which knows it's acting as above and can do the same calculations. Under these circumstances, the opponent will be one step ahead. The Hutter AI is predictable and so can be outmanoeuvered. Hence the Hutter AI's moves are not optimal.

    Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).

    Occam's razor (also spelled Ockham's razor) is a principle attributed to the 14th-century English logician and Franciscan friar William of Ockham. Originally a tenet of the reductionist philosophy of nominalism, it is more often taken today as a heuristic maxim that advises economy, parsimony, or simplicity in scientific theories. Occam's razor states that the explanation of any phenomenon should make as few assumptions as possible - Wikepedia [wikipedia.org]
    Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.
  • by swillden (191260) * <shawn-ds@willden.org> on Sunday August 13, 2006 @11:14PM (#15900595) Homepage Journal

    You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically.

    Mmmm, no. You were fine up until you said "very few will follow the ruleset". That's not true. To see that it's not true, take your kernel, which consists of around 10 million bits. Now find, say, 512 of those bits that can be changed, independently, while still producing a valid-looking kernel executable. The result doesn't even have to be a valid, runnable kernel, but it wouldn't be too hard to do it even with that restriction.

    So you now have 2^512 variants of the Linux kernel, all of which look like a valid kernel. But there are only 2^128 possible hashes, so, on average, there will be four kernels for each hash value, and the odds are very, very good that your "real" kernel's hash is also matched by at least one of them. If by some chance it isn't, I can always generate a whole bunch more kernel variants. How about 2^2^10 of them?

    A hash plus a filter ruleset does not constitute a lossless compression of a large file, even if computation power/time is unbounded.

  • by Anonymous Coward on Monday August 14, 2006 @07:09AM (#15901599)
    We should make a distinction between practical compression algorithms that operate under time constraints, and the hypothetical compressor that uses an extremely complicated ("AI") model.

    Consider the compression of text using the best available N-gram models and whatever is the state of the art right now. If you use the same stochastic model to generate a random text of moderate length, you will end up with nonsense with a probability of approximately 100%. This means that most of the potential compressed bitstrings that can be output by the state-of-the-art compressor are useless, and that a better compressor could in principle be more efficient.

    Of course, the fact that better compression is possible in principle does not mean that it is achievable in reasonable time on anything resembling our current computer technology.

I bet the human brain is a kludge. -- Marvin Minsky

Working...