Forgot your password?
typodupeerror

Compress Wikipedia and Win AI Prize 324

Posted by CmdrTaco
from the what-does-this-mean dept.
Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."
This discussion has been archived. No new comments can be posted.

Compress Wikipedia and Win AI Prize

Comments Filter:
  • WikiPedia on iPod! (Score:2, Interesting)

    by network23 (802733) * on Sunday August 13, 2006 @06:51PM (#15899786) Journal

    I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying [sourceforge.net]

    info.edu.org [edu.org] - Speedy information and news from the Top 10 educational organisations.

  • by Harmonious Botch (921977) on Sunday August 13, 2006 @07:09PM (#15899835) Homepage Journal
    "The basic theory...is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program." In a finite discrete environment ( like Shurdlu: put the red cylinder on top of the blue box ) that may be possible. But in the real world the problem is knowing that one's observations are all - or even a significant percentage - of the possible observations.
    This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.

    TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.

    --
    I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.
  • by aiken_d (127097) <brooks@tangent[ ]com ['ry.' in gap]> on Sunday August 13, 2006 @07:15PM (#15899857) Homepage
    Given that the hypothesis is valid (which is arguable), it seems to me that compressing wikipedia is a fairly useless way of supporting it. It seems like an abstraction error: Wikipedia is *not* a set of rules that predict the observations in it. It's a list of observations, sure, but there's no ruleset involved. Now, someone/thing who can read and parse language can get educated based on the knowledge in wikipedia, but then the intelligence is providing the ruleset, just training itself with the raw data in wiki.

    It really seems like one of those mistaking-the-map-for-the-territory errors.

    -b
  • by larry bagina (561269) on Sunday August 13, 2006 @07:28PM (#15899901) Journal
    1) it wouldn't be lossless and 2) most compression techniques use a dictionary of common used words.
  • Re:Painful to read (Score:1, Interesting)

    by Anonymous Coward on Sunday August 13, 2006 @08:03PM (#15900006)
    Mahoney is a professor at my college. I've taken his classes. He talks just how he writes.
  • by Anonymous Coward on Sunday August 13, 2006 @08:10PM (#15900020)
    I would argue that lossless compression really is not the best measure of intelligence. Humans are inherently lossy in nature. Everything we see, hear, fear, smell, and taste is pared down to its essentials when we understand it. It is this process of discarding irrelevant detials and making generalizations that is truly intelligence. If our minds had lossless compression we could regurgitate textbooks, but never be able to apply the knowledge contained within. If we really understand, we could reproduce what we've read, but not verbatim. A better measure of intelligence would be lossy text compression that still retains the knowledge contained within the corpus.
  • by kognate (322256) on Sunday August 13, 2006 @09:45PM (#15900333)
    Yeah, but you can use Turbo codes to achieve near Shannon limit, and you don't have to worry too much about the addition of the ECC. Remember kids: study that math, you never know when information theory can suddenly pay off.

    Just to help (and so you don't think I made Turbo Codes up -- it's sounds like I did 'cause it's such a bad name)
    http://en.wikipedia.org/wiki/Turbo_code [wikipedia.org]
  • by Anonymous Coward on Monday August 14, 2006 @02:27AM (#15901005)
    Hutter naming this theory after himself is silly. It's called the theory of "minimum description length" and it's been around for a while (well before the 2004 copyright on Hutter's book). The idea is to find some model which minimizes the sum: size of model + incorrectness of model's predictions. The Linguistica [uchicago.edu] folks use it, and in fact would probably be in the running for this prize.
  • Total compression (Score:2, Interesting)

    by HTH NE1 (675604) on Monday August 14, 2006 @11:36AM (#15903064)
    If you think you can compress a 100M sample of Wikipedia better than paq8f

    Anyone can write a program that can compress that sample down to zero bytes. The simplest such implementation of the program will be slightly bigger than the sample however and could only be used to decompress that sample.

    Down to one byte, it could work with up to 256 different samples, but only those, and would still be slightly bigger than the sum of those 256 samples.

    (Basically, given a byte, regurgitate the whole text which was precompiled into the program.)

    A condition of the contest should be that the combination of the program and compressed data should be smaller than both the uncompressed data and the combination of paq8f and the compressed data.

    Dare I recoin a phrase: Any sufficiently advanced compression algorithm is indistinguishable from a filing system.

My idea of roughing it turning the air conditioner too low.

Working...