
How I Completed The $5000 Compression Challenge 414
Patrick Craig points to this page which details (in email) the strange life of an online challenge as posed, clarified and (at least to some eyes) met successfully. This should give pause to anyone willing to pose -- or accept -- seemingly impossible problems. In short, Patrick's not $5,000 richer, but you may think he deserves to be.
Re:Intriquing (Score:2)
which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator).
But every number CAN potentially be produced by a random number generator. The hypothetical file full of zeros doesn't violate this because you only know you got all 0s after the fact. Watching the bitstream, each bit that came out had a 50% probability of being a 1, it just happens that none of them were over that particular sample period.
Thus, given 2000 (or any arbitarry number) if 1 Mbit samples (also arbitrary) from a true random number generator,upon examination, one of them MAY contain all 0s. You still have no way to predict the contents of the others (the all 0s file has exactly the same probability as any other arrangement of bits), and you have no basis to predict the content of the remaining files.
If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.
Re:sorry! (Score:2)
It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win.
The only place I've seen that challenge is in the compression FAQ. It's in the same section of the FAQ that explains the impossability. It's hard to call something a scam when the 'brochure' explains that you will certainly loose your money.
It's even more understandable when you consider that the USPTO has issued several IMPOSSIBLE patents for compressors that work on random data (recursivly no less)!
Re:Intriquing (Score:2)
I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.
Given no other information, I would suspect the same. That's why I had to specify. Replace coin with random bit generator if you prefer.
What is "total file size"? (Score:2)
Although, by both providing a certain file AND specifying the file size (3145728 bytes), one could argue that he *implicitly* defined file size as the number of bytes within the file. Hence, he loses.
Re:"to be disbursed at my sole discretion" (Score:2)
However, he was also asking for $100 up front, to encourage only serious submissions. I don't think he'd get many submissions if he tried both tactics. :)
No, and it's provably false (Score:2)
There is *at least* one sequence that cannot, by an arbitrary method, be reduced to a shorter length. On top of that, you need to use some bits for the compressor.
OTOH, if you can include "external" information in your compressor (either the algorithm or a function), you can have an algorithm that will compress any stream, and compress your favorite stream to a single bit. As an example, the first bit is 1 (and the only bit) if it is your stream, and 0 followed by the entire stream for any other stream.
hawk
Re:Compression (Score:2)
Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out.
The bet can be won; not in every instance of course, but in the long run.
--
Re:Compression (Score:2)
In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.
--
Re:Compression (Score:2)
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.
If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466 [slashdot.org].
--
Re:Compression (Score:2)
Good point, the proposed machine model is not Turing complete and you can analyze it ahead of time.
If I remember my Kolmogorov complexity right, the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n), where a and b depend on the (Turing complete) machine model. If the model is chosen wisely to make the constants small, it should be possible to compress more than 1/50 of files by one bit. But it's not nearly as simple as I thought.
--
Re:Compression (Score:2)
It's a*2^(-n), sorry.
--
I'd comment... (Score:2)
Re:An ingenious solution... (Score:2)
You are correct, however, in saying that -any- sequence can be generated by a random number generator. However, there is no general solution to the compression of a sequence S into a smaller sequence C, where a unique 1:1 mapping exists.
HAVING SAID THAT, if you compressed S into C, and could manually determine which of the reverse mappings was S, you -would- be able to "compress" generic random data of any size and of any degree of randomness.
(This is because the user posesses some additional information, I, which is therefore redundant and can be ignored, for the purpose of the compression.)
Re:An ingenious solution... (Score:2)
Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".
If, however, we store all the files in a single indexed file, there's no dead space to account for.
The filenames could be considered another issue. However, again, DOS' root directory is fixed-size and therefore file entries within it take no additional space.
Lastly, it doesn't matter if the guy "admits" to exploiting vague wording. Within the wording of the rules as given, he won. Within the intent of the rules, as given, he lost.
Ergo, BOTH people have a claim to "victory". They can either split the purse, or they can shake hands and donate it to some worthy cause.
In the end, this comes down to something I've been asked all too often myself. If you had to choose, would you rather be happy or right?
IMHO, the guys should cut out the righteous act, and stick with being happy. The contestant that he "defeated" the challange, and the guy who set the contest up, that he achieved his goal of reducing the snake oil surplus.
Let's face it, those are worthy achievements in themselves. Be satisfied! The money is just causing grief. Neither probably needs it. So why not clear the air, and give the cash to someone who does?!
Re:An ingenious solution... (Score:2)
Hex disk editors are fun tools. You might want to try them, some day, when you're finished trolling and want to get some work done.
The dead space (NOBODY calls it "slack space"! Slack is what people do on Fridays) is a function of sector size, and you can define whatever sector size you like. (The defaults are the "sensible" geometry, but by no means the only one.)
I suggest you also get Peter Norton's books on MSDOS, and print out a copy of the Interrupt listings, which you can find on the Internet. The Norton Guides are extremely handy, too. Oh, and you'll want a copy of Flight Simulator 2.0, which came on a 5.25" floppy. The "copy protection" scheme is to use 1 Kb sectors, which not only stopped DOS reading them directly, it also gave the disk about 256 extra bytes per track.
Omega (Score:2)
He should have used Gregory Chaitin's [auckland.ac.nz] Omega number to generate the challenge file.
Actually I really don't understand Chaitin's work well enough to know if that would have saved him the $5000, but at least he (and the challenger) would have learned something about algorithmic complexity theory.
Zooko
no genius required (Score:2)
retrieved from random.org...
However, if someone decides to be cute and repeat this challenge with a pseudorandom sequence,
you don't need to be a good mathmetician, you just need to know how to read a paper written
by a good mathmetician... Look for the paper...
Massey, J. 1969. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory.
IT-15(1): 122-127.
Abstract -- It is shown in this paper that the iterative algorithm introduced by Berlekamp
for decoding BCH codes actually provides a general solution to the problem of synthesizing the
shortest linear feedback shift register capable of generating the prescribed finite sequence of
digits. The equivalence of the decoding problem for BCH codes to a shift-register synthesis
problem is demonstrated...
This is probably in most textbooks on linear codes for encryption and/or error detection and
allows you to recreate the shortest LFSR (linear feedback shift register, a common component
of pseudo random number generators), given a sequence of digits. Of course nobody would
use a non-cryptographically secure PRNG like random() would they?
So you take the number, pass it through this algorithm, find the shortest LFSR, and the
decompressor just takes the LFSR initial state and reproduces the sequence according to the LFSR.
Then again, the shortest LFSR can be thought of as the linear complexity of the number and who
knows it might just accidentally compress the sequence the challenger gives you (not likely,
but who knows?)...
Always remember... "stand on the backs of giants" whenever possible
Re:Compression (Score:2)
Once you can drop 1000 bits from the file, you can play with the offset itself, which has a good chance of being compressible or at the very least expressible in a smaller format.
A variation which might work (Score:2)
echo $1 | cat - $1
(about 20 bytes) and use the first 21 bytes of the raw file as the name of the compressed file.
This is in the same spirit as this attempt (hide the data in the directory) but avoids breaking the one compressed file rule.
Steve
Re:An ingenious solution... (Score:2)
Mike Goldman didn't make it clear what 'total size' meant - I think most programmers if you asked them would assume the sum of file sizes, if you didn't specify otherwise. His fault, I feel. He should pay the $5000 and write the rules more carefully next time.
BTW, if you do assume that 'total size' is just the sum of file sizes, you can compress anything down to zero bytes. Number each possible sequence of binary digits - the empty string is zero, '0' is one, '1' is two, '00' is three, '01' is four, '10' is five, '11' is six, and so on. Then just create the required number of zero-length files.
Mike defined total file size himself (Score:2)
--
Re:Intriquing (Score:2)
> but might deviate it in either direction by about 1% ?(enough to cover the overhead of the decompressor).
All purely random strings of numbers (e.g. pi, e or the square root of 2), tend to have short strings of digits that repeat randomly. I believe this behavior is called ``clustering."
Assuming a large enough file, say 3MB in size, we could find up to 26 strings of two or more digits that repeat enough times so that the compressor would be a series of substitution commands. For example:
s/a/311/g
s/b/47/g
s/c/9724/g
...
etc.
That comes out to an average of 10 characters per line (including end-of-line character), add another 20 characters for overhead, & the decompressing script could be kept to less than 300 characters. As long as each substitution applied to more than 3 places in the original, then there would be net decrease in total file. And since we are looking at the 26 most common strings of digits, this inductively should be possible.
Hmm. I seem to have reinvented pkzip.
Geoff
Re:Another tack... (Score:2)
Sure: any. It was just the decss source padded out to a prime. You could do the same for mozilla.
--
Re:PAY UP MIKE (Score:2)
If you want to be picky, Patrick utterly failed the challenge by not supplying compressed files. They were simply unaltered pieces of the original file. Sure, there were missing bits of the files that had been moved to the file names, but the fact that this slight of hand failed to compress the data can be proved simply by renaming the files. Even if the order of files supplid to the "decompressor" is preserved, renaming them renders the original data unrecoverable. Sure, moving data into the filename is a cute trick, but it isn't compression.
The goal of the challenge is to produce compression, not to win via some semantic shell game. It's Patrick who shed his honor by resorting to a semantic smoke screen and attempting to win by wordplay with no intention of actually producing any real compression.
Good deal for contestant (Score:2)
The payoff matrix is highly in favor of the contestant. You can't just assume that a random stream won't compress. Sure, it might not, but your odds are about even, not nearly as bad as a 50-to-1 shot.
---
*Read* the article before you post. (Score:2)
--
Re:The page has been removed by GeoCities (Score:2)
Re:sorry! (Score:2)
Re:Compression (Score:2)
i.e. let's say 1001010111001010101 (I know this number is too short) is not there replace the 1000 zero-bits with that number.
But I fear now it only get's more difficult to calculate the probabilties of a "win", but the outcome will stay the same.
Re:An ingenious solution... (Score:2)
And anyways, the guy basically admits that he cheated by exploiting the vague wording of the challenge.
I've got to hand it to Patrick, though, because if you do take the challenge out of context, he basically won...it depends on what the meaning of the words "compressor" and "decompressor" mean.
Re:An ingenious solution... (Score:2)
Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".
I never said that it would take additional FAT space, I'm talking slack space...if you knew anything about the FAT filesystem (which you obviously don't), you would know that its not sectors you have to worry about, it's clusters (GROUPS of sectors). there are only so many clusters because of FAT's fixed size... I was being QUITE generous in saying that 512 bytes is a practical average. On a 1.2 GB MS-DOS formatted drive (not FAT32, mind you, we're talking DOS here, not Windows), the cluster size ends up being something like 16K, leaving the average slack space to be 8K. so I was being VERY generous there.
Re:That wasn't in the challenge (Score:2)
I guess I could try it, though.
Re:That wasn't in the challenge (Score:4)
IF I'm not mistaken. (Score:2)
What Patrick did was take a small bit of information out of the file, and, rather than having a table of 'offsets' telling his decompressor where to insert 5's, he broke it up into a bunch of files, effectively using the directory/inote tables to store this information.
Had he not insisited on a single file, I'd say Patrick here would have technically won, though not at all in the spirit of the competition.
Re:Already compressed data? (Score:2)
Re:IF I'm not mistaken. (Score:2)
Guy should cough up 5 big ones then.
Re:An ingenious solution... (Score:2)
The challenger should have made it clear that these costs would be counted against the reconstructed filesize (or simply not allowed any deviation from the "one file" rule) The decompressor should operate by having all the files piped into stdin, thus eliminating filesystem state info entirely.
Re:Another tack... (Score:2)
Re:IF I'm not mistaken. (Score:2)
So, after some consideration, I would say that Patrick didn't succeed, even considering the specific wording of the challenge (and the discussion in the emails).
Re:it's "file size", and that's all it is (Score:2)
In order to meet the challenge you must provide to me a compressed file and a decompressor which together total less than the size of this original.dat, and which can generate a file identical to original.dat.
He doesn't say "file size" in this or any other passage. It is Patrick who says (and assumes) this.
Also, Patrick specifically addresses the issue of filename based compression in this passage.
It's probably also possible to meet the challenge with smaller file sizes by storing information in the filename of the compressed file or the decompressor, but I think most people would consider this to be cheating.
Thus, I think it is Patrick who hadn't considered the fact that his "decompressor" is not only just the 'dfi' script, but also the filename information used to indicate file boundaries and ordering. Like I said, his algorithm (much less his actual implementation) could not properly reassemble the file if it were fed all the properly ordered "compressed" files as an input stream (through a pipe, say), since it wouldn't know where to place the missing fives. So given that Mike said "total size", in all cases, when discussing and offering the challenge, I think his position is safe.
Now, I do think that Patrick was reasonably clever, and that Mike was overly smug in both the challenge and the initial news posting. And in the exchanges Patrick did explictly ask about file sizes, but I do not fault Mike for not clarifying further, since he had no a priori knowledge of the scheme, and therefore didn't know he would have to be so specific. But his use of terminology was consistent, and I think, appropriate, to protect against this type of scheme.
Anyway, that's just my take on it. Mike can hopefully learn some humility from this, and Patrick can learn that his "decompressor" is more than just the sum of its file bytes.
Re:Mike defined total file size himself (Score:3)
While not explicitly stated, those bytes ARE an integral part of the "decompressor" that was supplied (it can't work without them), and so including their size is consistent with the agreed upon rules, IMO.
Re:Are you for lawyers or against? (Score:2)
My momma told me never to take a sucker bet. Mike got caught by his OWN sucker bet...which takes a) greed and b) foolishness. He got lucky that the person who took him was such a good sport about it.
It'd be really nice to see Mike donate the money to ACM or EFF or somebody like that...that'd be a pretty classy way of taking care of this.
Re:Are you for lawyers or against? (Score:2)
Re:Compression & Time (Score:2)
Decompression can take roughly the lifetime of the universe (for large files).
Interestingly, my brilliant strategy is also a encryption algorithm
bukra fil mish mish
-
Monitor the Web, or Track your site!
Re:Compression & Time (Score:2)
bukra fil mish mish
-
Monitor the Web, or Track your site!
Lzip! (Score:2)
----------
Re:Compression (Score:2)
Maybe in Squishdot, as long as rexec is used, of course. You could import StructuredText from Zope and your post could act like a Wiki.
- *this* would look like this
- **this** would look like this
The above would become an actual HTML bulleted list.
http://this.would.be/a/link
|| Instant Table ||
|| 1 || 2 ||
Ah, well.
Re:Are you for lawyers or against? (Score:2)
Data point: The corolation for me is: modern US law as practiced is harmful, and anti-productive; this contest was clearly stated, and clearly won. Mike should pay up.
Re:Are you for lawyers or against? (Score:2)
The proviso given (which should not have been) was that the compressed file could be provided in multiple parts. What he forgot is that an ordered sequence of files has more information than just the sum of their contents. Ooops.
This is not a wording problem, but one of a poorly thought out bet. This guy gave 1:50 odds that no one could compress truely random data. He chose to allow conditions that made the bet doable, and then did not want to pay up when the bet was won.
Actually, even without using multiple files, I think the bet's doable for any single given data file (though you cannot write a generalized compressor for all truely random data). I should come up with the submission to solve for this guy's challenge.
Sucker Bets and Real Proposition Gambling (Score:2)
Sky, an extraordinary proposition gambler, responded by countering. He covered the offeror's eyes and bet he couldn't tell what color was Sky's tie. "No bet."
I say this to point out the difference between sucker betting and proposition gambling. It is a sucker bet to offer a bet challenging a mark to do something that is, in fact, impossible. In proposition gambling, the offeror gives the mark a proposition they are unlikely to win, or that is worded in such a clever way that the winning of it leaves the mark too embarassed, entertained or impressed by the cleverness to attempt to challenge having lost.
In this case, the first offer (compressing high-entropy data) was a sucker bet. Shame on him. The response, doing the sucker bet using multiple files, was a sweet proposition gamble. Shame on him. Once taken, the offeror-now-mark did weasel, but as I see it, ultimately welched.
I admit that I, too, didn't consider the import of the multiple files on first reading (but, hey, I didn't have $5K against $100 riding on the result).
At the end of the day, the sucker-bettor was the mark, and he had an earful of cider.
No sympathy.
A Hypothetical Question (Score:2)
This is considered "cheating" because he used 220 EOFs that are not counted in the filesize.
So, my question is this. If a contestant submits a single compressed file and a single decompressor file which are just 1 byte under the original file, does Mike void the entry because of the extra filesystem space of two files? If not, why not? If so, shouldn't the rules say that specifically, as two is the minimum total number of files one can submit?
--
gnfnrf
Intriquing (Score:3)
For example, a file full of 0's, is perfectly possible from a true random number generator, and just as likely as any other set of numbers. As any good geek knows, a file full of 0's compresses very very well.
In this case, I think the challenge has one major problem, and that is that there is no real time limit on the entry, and no requirement that the decompressor work effectively in general, it can be entirely specialised.
This means that the problem *may* be solvable, simply by throwing ungodly amounts of CPU power at it. With the appropriate code, it would be possible to simply hunt for unintentional patterns in the data and hope. If you're lucky, there will be enough indexable patterns present that it will outweigh the cost of the indexing, and the code necessary to replace the index points.
If you're unlucky, the data will have no patterns in the space you searched, and therefore you're out of luck.
This is why the guy noted that if he got a big enough file, he could win the competition. As the filesize increases, it becomes increasingly likely that there are, accidentally, enough patterns in the data to cover the space required for the decompressor and the dictionary. Of course, as the filesize increases, you also have to burn considerably more CPU in order to locate the patterns in the first place.
Personally I think it would make an excellent project for a research team, not particularly for the money, but in order to create some really fast, specialised code for locating patterns in assumedly random data.
I am not aware of much effort in that area, most compression research goes into identifying patterns inherent in the data type, waveforms in audio, visual patterns in images, duplicate strings in text and binary files. It would be very interesting to see just how effective fast analysis code and heavy CPU is against the problem.
I suspect there may well be some crossover here with Cryptoanalysis, since one of the primary jobs in breaking crypto is finding the patterns that differentiate the ciphered data from pure randomness.
Any links appreciated.
Almost. (Score:2)
By and large you're right that random data can't be compressed. But getting really random data is a far-from-trivial undertaking.
Another tack... (Score:2)
My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way? Of course the chances of finding these type of numbers are very slim, but they *are* infinitely large spaces, and the compression would be ultra-compact.
Maybe something for a quantum computer to chew on anyway...
Here is a Real Challenge (Score:2)
SuperID
Free Database Hosting [freesql.org]
The problem (Score:2)
The information that could be considered "removed" by the files, would basically be the start and end offsets of the chunks, if they were all in one file...which would naturally add up to more than the original (219 * 2 additional info). The "missing" information escaped into the unix shell and filesystem, through the facility of "cat" being able to print out a block of bytes given a single piece of info.
Re:Compression (Score:2)
Re:One born every minute (Score:2)
But this solution WOULD work (Score:2)
(As noted in another post) As long as you pick an initial file length long enough that 2% of the time there statistically should be enough repeats of some 3 byte (or longer) array to equal the number of bytes in your decompressor (which would be roughly the same size as Patrick's), then you will win the challenge enough times to make it profitable.
Re:Compression (Score:2)
Not if you expand on Patrick's original idea replace the series of 1's (or zeros, your compressor should determine the longest continues series of each and then use the longer) and replace it with EOF. Since EOF can't appear in the original file except at the end, your de-compressor will know that this is the position to insert the original series. You could also repeat this as many times as like, provided the replaced series is at least as long as length of the EOF character plus the length of each new file name created plus memory space needed to store the value of the length of the series.
Re:Are you for lawyers or against? (Score:2)
[ ... ]
Specificity is critically important when you're communicating with a computer. But over specificity is actually a deterrant to communicating with another human. Ambiguity is a common mechanism by which we humans learn. So, for example, when I say "cat" you have one image in your mind and I have another image in my mind. But when we both discover that the thing that I was talking about is different than the thing you were thinking, we'll have expanded our understanding of what the word "cat" comprises. Just now, I was thinking of a cheetah. Do you remember when you learned that cheetahs and lions and tigers were also cats?
I'm overstating some things here becuase it's been about 8 years since I took my philosophy of language class. But to sum up, generally human communication works much more efficiently when there's some ambiguity in it.
But it's not just in the event of silly cases. You are cornered by the silliness everyday. Clearly you use a computer connected to the Internet. Do you think that the terms of service that govern your connection protect you? The point is that if we can't figure out a better way to deal with specificity and intent, then we consign ourselves to the rules that are written by those who can afford to write them.
I don't have an answer for this either, but I'd like to think that the discussion isn't so easily dismissed.
Are you for lawyers or against? (Score:5)
If you are for the challengee, then you believe that exact specificity is more important than intent. And you inadvertantly empower the legal profession to profit from being better able to specify rules, etc. Which, in turn, gives those with money even more power. They can write the rules by which your interaction with them will be governed. You can be certain that the rules will be written to the advantage of the guy with the money.
If, on the other hand, you are for the challenger, then you believe that his intent was more important. (Considering that the challenge came from the compression FAQ, the exact definition of "compression" should have not been vague. It seems clear to me that his intent was obvious, even though his challenge wasn't very specific.) Still there is a problem with valuing intent over specificity. That is it's easy to misrepresent intent after the fact. Thus the need to clearly specify our intentions prior to anything we do. That being the case, with accurate documentation, it seems possible to me that we can discern the original intent w/out having to specify everything legalistically. For example, I think that I was able to clearly determine the challenger's intent well after the fact, and I did this having read the challengee's representation of the events.
Now you may say, in rebuttal, "Who cares who I choose? I don't get to make the law." But think of yourself as the jury trying to determine to whom you would award the money. You may one day find yourself on a jury and trying to determine who should be awarded money. I see this as a microcosm of exactly how our court system has gotten into the state that it's in. Specificity is clearly more important than intent. And it's that way because juries award that way. Since juries are comprised of members of society, it's not a far stretch to say that society as a whole regards specificity as more important than intent.
Is there any chance that we can change this? And if we did, would it be better?
Re:Intriquing (Score:2)
Re:The page has been removed by GeoCities (Score:2)
And here [fibrespeed.net] if needed.
Re:An ingenious solution... (Score:2)
Re:There is no solution... (Score:2)
Re:But then what about "compressible" files. (Score:2)
The challenge specifies that you can choose your compressor after you had a peek at the datafile. If it turns out to be compressible enough by a known algorithm, just ship that compressor. If it isn't, use your meta-compressor.
However, I have a gut feeling that the uncompressible files are too numerous for this meta-compressor to work: if 99% of all possible files are uncompressible, then the index number of the file would not be that much smaller than the file itself, and you couldn't fit the meta-compressor's code (which would be rather complex...) in the small space saving.
As others have pointed out, a better alternative would be to simply "play the odds": make an algorithm which only works in 10% of the cases. If the file can't be compressed, resubmit an entry until you get one that works. On average you would have spent $1000, and won $5000!
Re:IF I'm not mistaken. (Score:2)
Re:PAY UP MIKE (Score:2)
A Solution (Score:2)
Re:Compression (Score:2)
PAY UP MIKE (Score:3)
.
Patrick I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed fileMike Sure.
Pay up Mike. Pay up. You were too careless to plan out the rules and think about your intent before you threw 5k into the sink. Cough up the cash or loose all honor.
it's a bad challenge regardless of wording. (Score:2)
Let us disregard the issue of money and the issue of a specific answer's acceptability and analyze the question
In my opinion any person, given enough time to come up to speed on the subject, and given that that time, plus the time it takes to come up with and implement their solution all multiplied by the factor of time it takes to maintain themselves (eating, resting etc.) during that time, and given that they live that long, can come up with a winning solution to Mike's challenge due to a flaw, not in wording of the challenge, but in Mike's reasoning behind the challenge.
To put a lot of discussion to rest (in my mind) just read the relevent portion of the FAQ in which the challenge is found. i.e. the section entitled "9.2 The counting argument [faqs.org]"
Here you'll find, clearly stated, the flaw in trying to find any one algorithm which will always compress data of a fixed length to less than that length. Also you'll see the decent challege by Steve Tate which paraphrased says:
Mike Goldman's addendum challenge effectivly states:Reading it in its full context, which aparently a lot of slashdotters didn't, it immediately gives itself up as a challenge which does not hinge on on the mathematical theorem it is presented with. i.e.
Given, there is no one algorithm which can compress all data, nor a subset of data selected by being a fixed length.
Therefore, if you give me an algorithm which claims to compress all data or all data of such a subset, I can give you at least one counter example.
It does not follow that I can give you one counter example to all compression algorithms for a given length (of your choice).
The implication of the first challenge is that the poser is confident in both the mathematical theory and either his analytical abilities when studying the given algorithm, or in his ability to test and check against the given algorithm in a sufficient period of time.
Conversely, the implication in the second challenge is that Mike in fact is the possessor of a set of super-data, of which any sub-set cannot be compressed by any algorithm.
Since we live in a time when any data sufficiently limited in size yet sufficiently large enough to offer decent gains when compressed (the challenger may pick that size) can be analyzed fully and with great speed, I think Mike has made a bit of a fool of himself.
I am not saying that Mike can't supply <<the uncompressable data from hell>>, but he will have to go to some lengths to get it. It will likely be selected to be large enough so Mike cannot generate it himself without an algorithm. It cannot be generated with a small input set or that set could be discovered and when coupled with the generator would defeat the challenge. It may be generated from a large set, but the result must show no patterns that can be bitstuffed (see networking) against or more simply generated and indexed for, nor may the source set contain any of the same properties. It may be possible that Mike is collecting this data from a source, i.e. digitizing random radio signals; we may all be interested in any source data which cannot by any reversable algorithm be represented/stored more succinctly particularly without any regard for retaining its meaning while compressed.
In summary, if Mike doesn't want to be giving away money, then he doesn't understand his own challenge's failure to be based on any mathematical certainty (unless I'm wrong and he has some super-data). However Patrick's solution was clearly not compression, though it was the start of a decent form of compression. Had concatenated his file-set and headed it with information as to each file's length in order, he just might have saved space enough for the decompression, particularly if he picked a suffiently large recuring patern that could be generated rather than having to be stored verbatim.
However, I might just be able to claim, that if the universe is deterministic, if the input conditions for the creation of the universe could be represented in less data than a generated block of data which can be represented in that universe, and if I were given enough resources in which to recreate the universe, I should be able to offer the input conditions of the universe and an index into it's ongoing generation which could point-to a generated block of data which exactly matches one provided. I just might be able to give you the process, its inputs, and your data before you've finished generating the data for my participation in the challenge, if in my generated universe I could significantly warp time without affecting the results.
-Daniel
That last part's an odd joke
Re:Compression (Score:2)
I'm surprised at the number of people using the FAQ's "pigeon hole" theorem backwards. The theorem states that a reversable algorithm cannot be used to represent every piece of data as a piece of data smaller than itself. It does not address being able to pick an algorithm well suited to just one specific piece of data. If there is any data to which no compression alogrithm can be applied to, tailored to, or made for, that's a pretty amazing piece of data, and we should all be made aware of it Mike.
-Daniel
"to be disbursed at my sole discretion" (Score:2)
Goldman's mistake was two fold:
1) He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.
2) This he failed to do so within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.
Re:"to be disbursed at my sole discretion" (Score:2)
2) This he failed to do within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.
Such mental typos being made at my decaffinated 6:54 AM brain's sole discretion.
Re:"to be disbursed at my sole discretion" (Score:2)
Re:Reality? or Data Hiding? (Score:2)
The point of the story is to be careful when issuing so-called "impossible" challenges. Don't leave loopholes in the rules!
Re:Bar room bets? (Score:3)
When Goldman gleefully posted to the newsgroup that someone had accepted his challenge, he said "some people just don't understand information theory too well". Yeah, and some people just don't understand issuing challenges very well, either!
When you make a challenge like that you absolutely must be sure to word the rules carefully and try to avoid leaving any loopholes. In this case, all Goldman really needed to do was tack a sentence at the end of his original challenge saying "the data must actually be compressed according to the definition of 'compression' contained in this FAQ. Challengers who merely point out loopholes or exploits in the rules of the challenge will not be rewarded".
The moment Mr. Craig started iquiring about about the specifics of the rules and saying "wait and see" regarding his methods, that should have been a warning sign to Goldman that Craig had some trick up his sleeve. But no, Goldman was just so in love with the idea that he found some "impossible" task which suckers might try to undertake that he couldn't be bothered to think about such things.
What's even more embarassing to Goldman is his initial reaction to Craig's submission:
Even though Craig had already asked if more than one file was okay and Goldman had said yes!And Goldman's following point:
is even more ridiculous given that Goldman had already said that "sure" in response to Craig's question about multiple files whose "total file size" (which does not include directory entries) are less than the size of the original data file. Goldman tried to change the rules rather than admit he was sloppy in setting them up.Goldman also says "I think that I did not make a mistake", even though he admits in the same message that he needs to restate the rules:
"Perhaps"? How about "definitely"! And any "miscommunication" was entirely one-way. Craig was very sly and clever, and Goldman was too full of himself to realize what the fuck was going on until after the fact. A "pompus ass" indeed.Re:Psuedo-random generators (Score:2)
I think in his post to comp.compression Mike wrote that his random data had come from random.org [random.org]. Random doesn't use a PRN generation algorithm; they get random data from atmospheric radio noise (tune your AM radio to a non-station and you'll hear what that means.) To remove any bias in the data they postprocess the noise by removing high order bits, and the low order bits are shifted to normal using a bit discarding algorithm.
The challenge was *not* impossible. (Score:2)
The comp.compression fact discusses that for any given compression algorithm, it is possible to craft an input fill that cannot be compressed.
This is different than the challenge proposed, where Mike, the challenger, provided a file of size X, and Patrick, the challengee, was required to provide a de-compression program of size Y and a data file of size Z, where X > Y+Z.
The difference here is that Patrick was allowed to hand-craft the compression method to the data file, but Mike was not allowed to hand-craft the original data to the compression routine.
Basically, Mike was wrong, and should pay the $5K.
Re:Reality? or Data Hiding? (Score:2)
Given a sufficently large source file of truly random data, I can create a binary and a data file which can recreate that one specific source file, losslessly, with my binary+data being smaller than the source. That is, I can create a routine that will work for just that one specific source file, no others.
The problem of 'Lossless Data Compression' refers to the opposite case- given any general purpose routine to compress files, I can create a file that it cannot compress. That is the oppposite of the challenge given, and actually is impossible.
As stated, even with the restriction of a single compressed file, the challenge *IS* winnable.
Re:The bogus ideas piggie bank (Score:2)
It is impossible to write a routine that can losslessly compress every possible 3-megabyte file that could ever be generated.
It is not impossible to write a routine that can compress the one specific file that Mike supplied to you. That is, it is quite likely that you could write a program that can losslessly compress that one single input set, such that:
That would only be true for the general casse compressor-decompressor. I could write a routine that would work for that one specific input file, but fail miserably on any other input. This would win me the $5K, but be valueless for any other purpose.length of decompressor) + (length of compressed data)
The mistake both you and Mike make is that while writing a program for the general case of compressing random data is impossible, it is very different than writing a program to compress one specific file of 'random' data.
Re:Compression (Score:2)
BTW, you can use < to get "less than", like this:
<
This is very very standard stuff in HTML.
--
Re:Compression (Score:2)
Strange... Hey ho, live and learn I guess.
--
Re:Compression (Score:2)
print 'so what do the other modes do?'
this_is_the_example_for('code')
return("cool, works the way I'd hope")
if __name__ == '__main__':
test()
--
Re:Compression (Score:2)
Oh, hang on, that's not right.
--
Re:Are you for lawyers or against? (Score:2)
I was very interested by your comments regarding the difference between specification and intent. However, I think there are some other factors in play here.
You mention intent - let's look at the challenger's intent. It was NOT to find an amazing new compression scheme, in other words this contest was not meant to sponsor research in the field. He set the challenge PURELY to get one over on everyone else, to shout "told you so - ha ha ha" like some kiddie cracker. As he's essentially trying to catch people out, I think the onus is on HIM to make sure he specifies things EXACTLY. What has happened is the smartass challenger has had his whole scheme flipped up and fired back at him, which I personally relish and applaud.
I think that specifity is important. Any software engineer will tell you that - the number one quote from a user when you deliver your system?? "Oh, I didn't MEAN I wanted it to do that...". You may think you know what the other party intended, but how can you be sure? The english language (in common use) is suitably ambiguous to make that impossible. (An aside - I'd love to know if developers in non-english speaking countries have the same problems, is english particularly ambiguous??). A lawyer would tell you that's exactly what they are trying to do, deduce intent from the actual written law (or contract or whatever). The two sides have differing interpretations, which are simply two different models of the original intent, deduced from the same written document. Where do you draw the line between the "quest for truth" and 2 bit scumbags just trying to pick holes in perfectly good laws? The simple fact is I don't think you can.
Which leads us to where we are - I don't really like it when a silly case wins in court due to a loophole, but I see no alternative. People will disagree, they will always disagree, so how do you settle it? If you don't go with exact interpretation of the letter of the law, you end up having to go with an interpretation of the spirit of the law, which puts total (and to me unacceptable) power in the hands of whoever happens to be making that interpretation. At least an argument over english linguistics is transparent and explainable. Hopefully the inclusion of a jury in many cases helps to add the "human touch" to the process - at least I know it did when I served recently.
Re:Compression (Score:2)
Furthermore, I reckon I can write a program G that will, when given any input file A of length>=X, will create a decompressor D and compressed file C such that:
D(C) = A
size (D) + size (C) size (A)
Don't know what X is yet, I supspect around 1 meg is fine though.
Anyone willing to bet either way?
Re:Compression (Score:2)
the condition should read:
size(D) + size(C) "is less than" size (A)
Slightly over-zealous HTML filter, methinks....
Re:Compression (Score:2)
Re:Compression (Score:2)
Anyone willing to bet either way?
Yes, assuming appropriate restrictions against storing information in the compressed file's name or attributes, I'll give you N:1 odds that you cannot write the program G, where N is any positive number. X can also be any number.
Or, save yourself time, don't try to write the program at all, just send me money.
Re:Compression (Score:2)
It's not so easy to generate a file to resist compression. If you had and used a method for determining whether a file could resist compression, then I could compress your generated file simply by counting how many compression-resistant files preceded it in some sort order and writing that number as a binary file. Decompression would proceed by counting that number of compression-resistant files in the same order and writing the file found at the count.
However, your decompression program would have to be smaller than M - log_2 (N) = log_2 (2^M/N) bits, where M is the number of bits in the compression-resistant file, and N is the number of compresion-resistant files preceding it in your sort order. So, the number of compression-resistant files would have to be small relative to the number of possible files if you want more than a bit to write your program. Since this is not the case, you would fail.
Re:Compression (Score:2)
Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out. The bet can be won; not in every instance of course, but in the long run.
That reflects a mistake on the part of Mike Goldman. If you want your decompressor and compressed file to be a total of length N, you can store log_2(N) bits of information in how you set the lengths of the two files. If you have more than two files, you can store even more information. Mike should have required that (number of bits in original data)+log_2(length of original data) > sum of number of bits in each compressed file and decompressor + log_2(product of lengths of each file and of decompressor), and he should have permitted himself to rename or change the file attributes of any file he wished in case information was being stored therein.
Re:Compression (Score:2)
Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.
In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.
No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.
I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.
The page has been removed by GeoCities (Score:5)
--
Compression (Score:4)
However, to make money, you don't have to.
All that is required is a compresser that saves 1 bit on better than 1 in 50 attempts.
For example, if we can save 1 bit on 50% of the attempts - even if the other 50% of the attempts results it catasrohpically large files. On average we've lost the compression war.
However, on 100 attempts, we've paid $100 x 100 or $10000 but we've won on half of these - netting us $250000 or a cool $240000 profit.
Anyone want to demonstrate this?
Re:Compression (Score:3)
"Since this is not the case, ..."
Sorry, not so. The proportion of compression-resistant files necessarily decreases as file length increases. To see this, consider that if a file had a nice block of 1024 bytes of zeroes somewhere in it, we could easily write a simple program that wrote the bytes before that block, then 1024 bytes of zeroes, then the bytes after that block. This program would easily fit in less than 1024 bytes, plus the bytes before and after the block of zeroes. So such a file is compressible. I will show that the proportion of files that are not compressible like this decreases as file length increases.
To make this easy, I'll count for compressible files just the ones where a block of zeroes begins at a multiple of 1024 bytes from the beginning of a file. The chance that a single 1024-byte block (selected randomly from a uniform distribution) is not all zeroes is (2^8192-1)/2^8192, a very tiny bit less than one. The chance that all n blocks of n 1024-byte blocks are not zeroes is (2^8192-1)/2^8192)^n. Obviously this approaches zero as n increases. Thus the proportion of files that are not compressible decreases as file length increases. Pick a large enough length, and you are virtually guaranteed there is a 1024-byte block of zeroes in it somewhere.
When we open compression up to a wider range of techniques, the same effect occurs; the proportion of files that are not compressible under any given, reasonable technique decreases as file length increases. This makes Goldman's challenge a bad offer for him -- the fact that the average file size change must be positive under any compression technique in the limit does not mean the probability of a positive change in a randomly selected file is large enough to give him a positive return on the challenge.
Re:IF I'm not mistaken. (Score:3)
> total file size is less than the original uncompressed file and from
> which I can regenerate the original uncompressed file
>
> Patrick
Sure -- but you send me a *decompressor*, I don't need the compressor.
Sure means Sure. As far as I can tell this is acceptance of Patrick's request. As Mike is the guy inventing the rules, I don't see why this wouldn't be binding.
That wasn't in the challenge (Score:4)
If the guy had read the FAQ, he would know that such tricks are not compression
Did the rules as stated say that the solution had to comply with the principle laid out in the FAQ? Not that I can see. If you're going to pup up a $5,000 challenge, you damn well better post all the rules.
For instance: I had a friend who entered a paper airplane contest. He signed up for the "maximum distance" part. The only rules were that the contest organizers supplied the single piece of paper, you had five minutes to fold or tear it however you want to, and then everyone threw them at the same time. The one that stops moving farthest from the launch line wins.
He waited until 4:45 into the time, then crumpled the paper up into a ball and threw it. While everyone else's planes were circling lazily around, his went straight, landed, and rolled several more feet.
While you can debate all you want about whether his wad of paper counted as a "paper airplane" -- and we did debate (hey, it was college) -- he did saitsfy the rules of the contest. Had there been enough money on the line for someone to threaten legal action, I suspect he would have won.