Compress Wikipedia and Win AI Prize 324
Baldrson writes "If you think you can compress a 100M sample of Wikipedia better than paq8f, then you might want to try winning win some of a (at present) 50,000 Euro purse. Marcus Hutter has announced the Hutter Prize for Lossless Compression of Human Knowledge the intent of which is to incentivize the advancement of AI through the exploitation of Hutter's theory of optimal universal artificial intelligence. The basic theory, for which Hutter provides a proof, is that after any set of observations the optimal move by an AI is find the smallest program that predicts those observations and then assume its environment is controlled by that program. Think of it as Ockham's Razor on steroids. Matt Mahoney provides a writeup of the rationale for the prize including a description of the equivalence of compression and general intelligence."
WikiPedia on iPod! (Score:2, Interesting)
I'd love to be able to have the whole WikiPedia available on my iPod (or cell phone), but without destroying [sourceforge.net]
info.edu.org [edu.org] - Speedy information and news from the Top 10 educational organisations.
Re:WikiPedia on iPod! (Score:3, Insightful)
wikicast (Score:4, Funny)
So, we need a WikiCast - remember folks, you heard it here first!
Re:WikiPedia on iPod! (Score:4, Insightful)
Re:WikiPedia on iPod! (Score:2)
You didn't RTFA, did you?
Re:WikiPedia on iPod! (Score:4, Funny)
Re:WikiPedia on iPod! (Score:2)
Re:WikiPedia on iPod! (Score:2)
It's the whole instant-access thing.
But captain (Score:5, Funny)
But captain, if we reverse the tachyon inverter drives then we will have insufficient dilithium crystals to traverse the neutrino warp.
Re:But captain (Score:5, Funny)
Re:But captain (Score:2)
Re:But captain (Score:2, Offtopic)
Re:But captain (Score:2)
Re:But captain (Score:2)
Painful to read (Score:4, Insightful)
Re:Painful to read (Score:2, Insightful)
Yeah, I just read the write-up twice and have no idea if this is an AI contest, something to do with compression, or what. In fact, all I can remember now is the word "incentivize" which is the sort of thing I expect some bullshit salesman at work to say.
Re:Painful to read (Score:3, Insightful)
No, they need to verbize another noun when there was a perfectly good word in the language that means *exactly* what they want. feh.
Re:Not sure if that's a joke. (Score:4, Funny)
Re:Painful to read (Score:5, Funny)
He did, but Slashdot's AI compressed it for him.
:-D
lossy compression (Score:5, Insightful)
Re:lossy compression (Score:3, Insightful)
The human brain is a fuzzy clustering algorithm, that's what neural networks do, they reduce the space of a large
data set by eliminating redundancy and mapping only the salient features of it onto a smaller data set, which in bio systems
is the weights for triggering sodium/potassium spikes at a given junction. If such a thing existed a neural compression algorithm would be capable of immense data red
Re:lossy compression (Score:3, Insightful)
Lossy relational/dictionary based compression is the base. You hardly ever remember text by order of letters or sound of voice reading it. You remember the meaning of a sentence, plus optionally some rough pattern (like voice rhythm) to reproduce the exact text from rewording the meaning. So you understand meaning of sentence, store it as relation to known meanings (pointers to other entri
Re:lossy compression (Score:5, Insightful)
You just need to re-create afile that matches the md5sum and still follows the rules of a Linux kernel. It is extremely unlikely any other file that can be recognized as some kind of Linux kernel and matches. Of course there are countless blocks of data that still match, but very few will follow the ruleset of "ELF kernel executable" structure which can be deduced numerically.
Mmmm, no. You were fine up until you said "very few will follow the ruleset". That's not true. To see that it's not true, take your kernel, which consists of around 10 million bits. Now find, say, 512 of those bits that can be changed, independently, while still producing a valid-looking kernel executable. The result doesn't even have to be a valid, runnable kernel, but it wouldn't be too hard to do it even with that restriction.
So you now have 2^512 variants of the Linux kernel, all of which look like a valid kernel. But there are only 2^128 possible hashes, so, on average, there will be four kernels for each hash value, and the odds are very, very good that your "real" kernel's hash is also matched by at least one of them. If by some chance it isn't, I can always generate a whole bunch more kernel variants. How about 2^2^10 of them?
A hash plus a filter ruleset does not constitute a lossless compression of a large file, even if computation power/time is unbounded.
Re:lossy compression (Score:2)
As long as it is Wiki that we are talking about... (Score:4, Funny)
There. All of wiki, in 31 bytes.
Who'da thunk... (Score:5, Funny)
Re:Who'da thunk... (Score:2)
Easy! (Score:2, Funny)
Lossy Compression? (Score:4, Funny)
Re:Lossy Compression? (Score:2, Insightful)
Re:Lossy Compression? (Score:2, Interesting)
Would be useful for images (Score:5, Funny)
Comparison (Score:2, Informative)
Re:Comparison (Score:2)
Re:Comparison (Score:2)
Re:Comparison (Score:3, Funny)
Have no fear though, I'm working on a new one.
It's a big world out there (Score:5, Interesting)
This - in humans, at least - can lead to the cyclic reinforcement of one's belief system. The belief system that explains observations initially is used to filter observations later.
TFA is a neat idea theoreretically, but it's progeny will never be able to leave the lab.
--
I figured out how to get a second 120-byte sig! Mod me up and I'll tell you how you can have one too.
Re:It's a big world out there (Score:2)
This is precisely the assumption of Hutter's theory.
Chapter 2 of his book "Simplicity & Uncertainty" deals with this in more detail but the link provided does do an adequate job of stating:
Re:It's a big world out there (Score:2)
There is no allowance for lossy compression. The requirement of lossless compression is there for precisely the reason you state.
Re:It's a big world out there (Score:2)
"it is progeny"? Damn, I thought we'd fixed that bug. Back to the lab with you!
Re:It's a big world out there (Score:2)
No, that's just deductive science. I (or we, as a society) haven't tested that every cup, glass and plate in my kitchen (or the world) is affected by gravity, but I'm preeeeeeetty sure they are.
The problem - and the really hard AI problem - is that there is no single "program", there's in fact several billion independent "programs" running. These "programs" operate i
Re:It's a big world out there (Score:5, Funny)
Your use of "TFA" is a good compressional technique, but you could change "it's" to "its" and actually GAIN in meaning while losing a character! You're well on your way...
Re:It's a big world out there (Score:5, Informative)
In it, Jaynes shows that an optimal decision maker shares this same tendency of reinforcing exiting belief systems. He even gives examples where new information reinforces the beliefs of optimal observers who have reached opposite conclusions (due to differing initial sets of data). Each observer believes the new data further supports their own view.
Since even an optimal decision maker has this undesirable trait, I don't think the existence of this trait is a good criteria for rejecting decision making models.
Re:It's a big world out there (Score:3, Informative)
I think what Hutter has shown is that there is a solution which unifies the new data with the old within a new optimum, which is most likely unique. I think
Re:It's a big world out there (Score:2)
This effect is often seem clearly when as I said before and you quoted:
Local optima (Score:2)
Re:It's a big world out there (Score:3, Insightful)
Re:It's a big world out there (Score:2)
Right- then all you need to do is run the data through the AI system and make a list of the few times it is wrong- This would be a small list. Then add this list to the end of the compressed data- If the AI is any good then you should still have fantastic compression.
Re:It's a big world out there (Score:4, Interesting)
Just to help (and so you don't think I made Turbo Codes up -- it's sounds like I did 'cause it's such a bad name)
http://en.wikipedia.org/wiki/Turbo_code [wikipedia.org]
Cool... (Score:2)
Re:It's a big world out there (Score:2)
Hoo-ray for empiricism! Which as Hume pointed out is circular reasoning.
It does keep one from getting hit by buses, and is the driving force behind the alarm clock industry.
So it's not a total waste, nevermind the nasty philosophical bit.
Comment removed (Score:4, Interesting)
Re:Er, I'm not so sure about this. (Score:2, Offtopic)
The distinction between compressed experience and rules is an illusion. Rules _are_ compressed experience in the same sense that "x+y=z" is a compressed representation of the table of all ordered pairs (x,y) of numbers and th
Solution. (Score:5, Funny)
Then just apply your personal favourite compression utility.
I like lharc, which according to Wikipedia was invented in 1904 as a result of bombarding President Lincoln, who plays Commander Tucker in Star Trek: Enterprise with neutrinos.
That's easy (Score:2)
Re:That's easy (Score:2)
The word you should be using is, well, "word." That's what we used to call the bus width of the system. Today, though, the world is much more complicated. Lots of different buses, lots of different bus widths. CISC instruction sets are more CISC than they used to be (outside of mainframes).
So really, that doesn't apply very well even in the way you intended it.
Well.. doesn't the dictionary make it smaller??? (Score:2)
Incentivize? (Score:5, Funny)
Sorry, anything which uses the word "incentivize" does not involve intelligence, natural or artificial.
Re:Incentivize? (Score:2)
The root word is "incentive" and it wasn't until the last few decades that people came up with "incent", "incentivize", and -- god forbid -- "disincent".
May I suggest these alternate words:
- encourage
- give incentive
- influence
- motivate
- stimulate
Hell, even "prod" is a better word. Now let us raise our torches and pitch forks and put these rogue words to rest.
I'll try: (Score:5, Funny)
Mine wins as it is roughly 40 bytes total.To get your results, you simply need to run the self-extracting archive, and wait. Be warned, it will take a while, but that is the cost of such a great compression scheme!
Re:I'll try: (Score:4, Funny)
Here's one that's even shorter, but you have to type in the decryption key exactly right.
Reductionary Transformation Theory (Score:2)
Is lossless really best (Score:2, Interesting)
Bingo (Score:2)
Minor disagreement with what you said... The details are relevant; they just aren't important enough to store in and of themselves. Sort of like mathematics where you only need to lear
Hutter's theory? (Score:2)
He cites Solomonov (Score:2)
wikipedia (Score:2)
C++ (Score:3, Funny)
A (good) sign of the times, I guess.
My entry (Score:2)
how about.... (Score:2)
# bzip2 wikipedia.tar
Remove words and punctuation (Score:2)
The above text is a demonstration. No punctuation. Left out "and" and "the". Since some entries are hard to read, the fill-in-the-blank interpretation adds to the experience!
If this new compression is added to the
Could the winning entry... (Score:2)
I've got it! (Score:2)
And there was me thinking.. (Score:2, Funny)
yeah, you guessed it..
42...
There is a theory that (Score:2)
The problem is that it is non-trivial to find the proper algorithm and parameters. However, if trial-and-error computation is not a problem, a large library of potential generators can be tried with all possible parameters.
Compressing text (Score:2)
The way I see it, there's mainly one good way to compress text (Disclaimer : I do not know if this idea has ever been used before, but if it has i'd be glad to know about it). Let's say you index the values of the most frequently used characters, and that you keep a 'joker' value to insert a custom character after, let's say that it sums up to about 45 characters for these indexed characters. You replace each character by its indexed value, and instead of having it in base 256, you calculate it all in base
Doug in a dress? (Score:2)
Hutter's Theory - Disproved (Score:4, Insightful)
Human poker players address this issue by deliberately introducing slight randomness into their play. I think a "Hutter AI" will make better real-world decisions if it does the same (see Game Theory).
Occam's razor is also highly suspect. There's the issue of cultural bias when counting assumptions. And all programmers will be aware of how they fixed "the bug" that caused all the problems in an application, only to find there were other bugs that caused identical symptoms.
Compress Wikipedia and win a prize? (Score:5, Funny)
Re:for those who rtfa (Score:2, Informative)
18MB
b) how many bytes was wikipedia before it was compressed
A sample of 100MB
Your goal:
.
KFG
Re:Can it be "lossy" compression? (Score:5, Funny)
Re:Can it be "lossy" compression? (Score:2)
Re:Can it be "lossy" compression? (Score:3, Informative)
I downloaded the corpus, and indeed, you're right -- it's 10^8 bytes. The article is incorrect, it says 100M where it means 95.3M.
This inconsistency doesn't have any effect on the challenge, though -- that 50kEUR[1] is offered for compressing the given data corpus, not for compressing a string of 100MB.
[1] 1kEUR=1000EUR. 1M EUR=1000000EUR. 1KB=1024B. 1MB=1048576B.
And by the way, what about fixing Slash to finally allow Unicode -- either natively or at least as
Wrong contest (Score:4, Informative)
The contest for the Hutter Prize requires the compressed corpus to be a self-extracting archive -- or failing that to add the size of the compressor to the compressed corpus.
Self-extracting on what platform? (Score:2)
For what architecture?
Barebones Windows or Linux (Score:3, Informative)
Points are not awarded for attempting to circumvent the intent of the competition. I expect such attempts would result
Re:Can it be "lossy" compression? (Score:5, Funny)
Re:Can it be "lossy" compression? (Score:2)
Re:Can it be "lossy" compression? (Score:2)
Re:Easy compression rule (Score:2)
Huh? What do you think the compression software is doing right now? It searches through the file for blocks of similar information, and then replaces those blocks with pointers to an index, where it stores the bloc
Re:Easy compression rule (Score:2)
Very simple way to win this contest:
Re:Easy compression rule (Score:2)
Very simple way to win this contest:...
The only problem is that your index will be as big as your original file.
Let's say you want to store your ten-digit phone number using that PI technique. A ten-digit phone number will be found at an average index of 10^10, or 10,000,000,000, the problem is that index is 11 digits long, so your really not winning. In this particular case, you could hope to find your index around 8^100,000,000, which would take 100,000,000 bytes (or 100 MB, the original size) to store.
Re:Easy compression rule (Score:2)
Well... maybe. It depends a lot on what data you are trying to compress. As a best-case scenario, it can compress the entire decimal expansion of PI down to a single digit: "0". Infinite digits compressed to a single byte, not bad eh?
Re:Easy compression rule (Score:2)
Also, you would have to either store enough digits of pi (which would defeat any compression ideas), or calculate it "on site", which would be
Re:Easy compression rule (Score:2)
Re:I can convert the data to 1 bit. (Score:2)
libwikipedia? (Score:2)
What libraries is this executable allowed to call?
Yes (Score:2)
Re:lzip! (Score:2)
Bah, lzip is nothing compared to azip [bebits.com], which has all the infinite-compression goodness of lzip, but also supports lossless decompression. The downside is that it currently only runs under BeOS, but it comes with source so it should be easy enough to port to (whatever).
Re:Good idea (Score:2)