Forgot your password?
The Almighty Buck

How I Completed The $5000 Compression Challenge 414

Posted by timothy
from the no-strange-but-true-icon dept.
Patrick Craig points to this page which details (in email) the strange life of an online challenge as posed, clarified and (at least to some eyes) met successfully. This should give pause to anyone willing to pose -- or accept -- seemingly impossible problems. In short, Patrick's not $5,000 richer, but you may think he deserves to be.
This discussion has been archived. No new comments can be posted.

How I Completed The $5000 Compression Challenge

Comments Filter:
  • which isn't true because if this were true every number would be a random number (because every number could potentially be generated by this number-generator).

    But every number CAN potentially be produced by a random number generator. The hypothetical file full of zeros doesn't violate this because you only know you got all 0s after the fact. Watching the bitstream, each bit that came out had a 50% probability of being a 1, it just happens that none of them were over that particular sample period.

    Thus, given 2000 (or any arbitarry number) if 1 Mbit samples (also arbitrary) from a true random number generator,upon examination, one of them MAY contain all 0s. You still have no way to predict the contents of the others (the all 0s file has exactly the same probability as any other arrangement of bits), and you have no basis to predict the content of the remaining files.

    If you flip a coin 100 times yielding all heads (assuming a non-biased coin), the odds of heads on the next flip is 1/2.

  • It sounds to me like the person offering the $5000 was scamming everyone because it's theoretically impossible to win.

    The only place I've seen that challenge is in the compression FAQ. It's in the same section of the FAQ that explains the impossability. It's hard to call something a scam when the 'brochure' explains that you will certainly loose your money.

    It's even more understandable when you consider that the USPTO has issued several IMPOSSIBLE patents for compressors that work on random data (recursivly no less)!

  • I don't know about you, but if I came across a coin that flipped heads 100 times in a row, I'd strongly suspect a biased coin.

    Given no other information, I would suspect the same. That's why I had to specify. Replace coin with random bit generator if you prefer.

  • The problem is that the definition of "total file size" is subject to interpretation. If the guy holding the contest had said "total file size as reported by ls -l", he'd be screwed. But he could argue that "total file size" means the number of bytes of disk used to store the file.

    Although, by both providing a certain file AND specifying the file size (3145728 bytes), one could argue that he *implicitly* defined file size as the number of bytes within the file. Hence, he loses.

  • He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.

    However, he was also asking for $100 up front, to encourage only serious submissions. I don't think he'd get many submissions if he tried both tactics. :)

  • There are 2^n ways to arrange bits. The number of ways to arrange all possible sequences of bits ranging from since 1 to (n-1) is 2^(n-1).

    There is *at least* one sequence that cannot, by an arbitrary method, be reduced to a shorter length. On top of that, you need to use some bits for the compressor.

    OTOH, if you can include "external" information in your compressor (either the algorithm or a function), you can have an algorithm that will compress any stream, and compress your favorite stream to a single bit. As an example, the first bit is 1 (and the only bit) if it is your stream, and 0 followed by the entire stream for any other stream.


  • As the length of the file increases, the number of bits needed to *specify the position* of this run of zeroes also increases. And you're going to have to record that position somewhere in your decompressor or compressed file.

    Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out.

    The bet can be won; not in every instance of course, but in the long run.


  • Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.

    In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.


  • No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.
    I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.

    If you allow me to specify the machine model ahead of time, you'd lose. This was observed by Anonymous Coward in article 466 [].


  • As understood the challenge, there was no requirement that the data I give you be random, though I might ordinarily give random data for such a challenge. If that was the machine model, however, I'd just give you a 0 followed by random data, and you would lose.

    Good point, the proposed machine model is not Turing complete and you can analyze it ahead of time.

    If I remember my Kolmogorov complexity right, the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n), where a and b depend on the (Turing complete) machine model. If the model is chosen wisely to make the constants small, it should be possible to compress more than 1/50 of files by one bit. But it's not nearly as simple as I thought.


  • the probability that a random file can be generated by a program that's n bits shorter is roughly a*b^(-n)

    It's a*2^(-n), sorry.


  • ...but Yahoo! seems to have taken the site down. Thank you, Yahoo!
  • No. There is a difference between random DATA and a random number GENERATOR. To be random, data must follow some mathematical rules, one of which is that the probability distribution is equal at all points. Another is that the next point's probability is NOT affected by the prior point's.

    You are correct, however, in saying that -any- sequence can be generated by a random number generator. However, there is no general solution to the compression of a sequence S into a smaller sequence C, where a unique 1:1 mapping exists.

    HAVING SAID THAT, if you compressed S into C, and could manually determine which of the reverse mappings was S, you -would- be able to "compress" generic random data of any size and of any degree of randomness.

    (This is because the user posesses some additional information, I, which is therefore redundant and can be ignored, for the purpose of the compression.)

  • The 218 files would have taken no additional space on the FAT, as it's fixed size.

    Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".

    If, however, we store all the files in a single indexed file, there's no dead space to account for.

    The filenames could be considered another issue. However, again, DOS' root directory is fixed-size and therefore file entries within it take no additional space.

    Lastly, it doesn't matter if the guy "admits" to exploiting vague wording. Within the wording of the rules as given, he won. Within the intent of the rules, as given, he lost.

    Ergo, BOTH people have a claim to "victory". They can either split the purse, or they can shake hands and donate it to some worthy cause.

    In the end, this comes down to something I've been asked all too often myself. If you had to choose, would you rather be happy or right?

    IMHO, the guys should cut out the righteous act, and stick with being happy. The contestant that he "defeated" the challange, and the guy who set the contest up, that he achieved his goal of reducing the snake oil surplus.

    Let's face it, those are worthy achievements in themselves. Be satisfied! The money is just causing grief. Neither probably needs it. So why not clear the air, and give the cash to someone who does?!

  • Yawn. I know perfectly well that the FAT uses clusters of sectors. So? You can configure the disk geometry in the boot sector.

    Hex disk editors are fun tools. You might want to try them, some day, when you're finished trolling and want to get some work done.

    The dead space (NOBODY calls it "slack space"! Slack is what people do on Fridays) is a function of sector size, and you can define whatever sector size you like. (The defaults are the "sensible" geometry, but by no means the only one.)

    I suggest you also get Peter Norton's books on MSDOS, and print out a copy of the Interrupt listings, which you can find on the Internet. The Norton Guides are extremely handy, too. Oh, and you'll want a copy of Flight Simulator 2.0, which came on a 5.25" floppy. The "copy protection" scheme is to use 1 Kb sectors, which not only stopped DOS reading them directly, it also gave the disk about 256 extra bytes per track.

  • by Zooko (2210)

    He should have used Gregory Chaitin's [] Omega number to generate the challenge file.

    Actually I really don't understand Chaitin's work well enough to know if that would have saved him the $5000, but at least he (and the challenger) would have learned something about algorithmic complexity theory.


  • As people have already mentioned, the numbers are probably not pseudo-random as they were
    retrieved from

    However, if someone decides to be cute and repeat this challenge with a pseudorandom sequence,
    you don't need to be a good mathmetician, you just need to know how to read a paper written
    by a good mathmetician... Look for the paper...

    Massey, J. 1969. Shift-Register Synthesis and BCH Decoding. IEEE Transactions on Information Theory.
    IT-15(1): 122-127.

    Abstract -- It is shown in this paper that the iterative algorithm introduced by Berlekamp
    for decoding BCH codes actually provides a general solution to the problem of synthesizing the
    shortest linear feedback shift register capable of generating the prescribed finite sequence of
    digits. The equivalence of the decoding problem for BCH codes to a shift-register synthesis
    problem is demonstrated...

    This is probably in most textbooks on linear codes for encryption and/or error detection and
    allows you to recreate the shortest LFSR (linear feedback shift register, a common component
    of pseudo random number generators), given a sequence of digits. Of course nobody would
    use a non-cryptographically secure PRNG like random() would they?

    So you take the number, pass it through this algorithm, find the shortest LFSR, and the
    decompressor just takes the LFSR initial state and reproduces the sequence according to the LFSR.

    Then again, the shortest LFSR can be thought of as the linear complexity of the number and who
    knows it might just accidentally compress the sequence the challenger gives you (not likely,
    but who knows?)...

    Always remember... "stand on the backs of giants" whenever possible ;^)
  • ahh...but would THAT 1000 bit number be random, or compressible on it's own?

    Once you can drop 1000 bits from the file, you can play with the offset itself, which has a good chance of being compressible or at the very least expressible in a smaller format.

  • As several people have observed, the attacker in this case used his "split into multi-files" trick to hide the necessary information in the directory structure of the disk. This is just barely ruled out by the "one compressed file" rule. However, suppose you hid the data in the filename of one file. For example write a program like

    echo $1 | cat - $1

    (about 20 bytes) and use the first 21 bytes of the raw file as the name of the compressed file.

    This is in the same spirit as this attempt (hide the data in the directory) but avoids breaking the one compressed file rule.

  • In this sort of contest, it's traditional that finding ingenious ways to exploit loopholes in the rules is a perfectly valid way to win. Then the rules are tightened up if ever the contest is re-run.

    Mike Goldman didn't make it clear what 'total size' meant - I think most programmers if you asked them would assume the sum of file sizes, if you didn't specify otherwise. His fault, I feel. He should pay the $5000 and write the rules more carefully next time.

    BTW, if you do assume that 'total size' is just the sum of file sizes, you can compress anything down to zero bytes. Number each possible sequence of binary digits - the empty string is zero, '0' is one, '1' is two, '00' is three, '01' is four, '10' is five, '11' is six, and so on. Then just create the required number of zero-length files.
  • Patrick specified a file size, and Mike generated a file of that specified size. Mike did not include any OS filesystem overhead in his file size. It seems rather hypocritial of Mike to claim OS file system overhead counts for Patrick's files but not Mike's. Mike defined FILE SIZE, not DISK SPACE USED.

  • > What I'm thinking of, is wouldn't it be possible to create a minimalist compression algorithm that tended to not affect the file size,
    > but might deviate it in either direction by about 1% ?(enough to cover the overhead of the decompressor).

    All purely random strings of numbers (e.g. pi, e or the square root of 2), tend to have short strings of digits that repeat randomly. I believe this behavior is called ``clustering."

    Assuming a large enough file, say 3MB in size, we could find up to 26 strings of two or more digits that repeat enough times so that the compressor would be a series of substitution commands. For example:


    That comes out to an average of 10 characters per line (including end-of-line character), add another 20 characters for overhead, & the decompressing script could be kept to less than 300 characters. As long as each substitution applied to more than 3 places in the original, then there would be net decrease in total file. And since we are looking at the 26 most common strings of digits, this inductively should be possible.

    Hmm. I seem to have reinvented pkzip.

  • > My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way?

    Sure: any. It was just the decss source padded out to a prime. You could do the same for mozilla.
  • If you want to be picky, Patrick utterly failed the challenge by not supplying compressed files. They were simply unaltered pieces of the original file. Sure, there were missing bits of the files that had been moved to the file names, but the fact that this slight of hand failed to compress the data can be proved simply by renaming the files. Even if the order of files supplid to the "decompressor" is preserved, renaming them renders the original data unrecoverable. Sure, moving data into the filename is a cute trick, but it isn't compression.

    The goal of the challenge is to produce compression, not to win via some semantic shell game. It's Patrick who shed his honor by resorting to a semantic smoke screen and attempting to win by wordplay with no intention of actually producing any real compression.

  • The payoff matrix is highly in favor of the contestant. You can't just assume that a random stream won't compress. Sure, it might not, but your odds are about even, not nearly as bad as a 50-to-1 shot.

  • Read the damn article before you post - he didn't use 'gunzip' in his proposition.

  • Their 486 got overloaded

  • by th0m (16656)
    there's no scamming involved. goldman required the (suitably small) entry fee to deter time-wasters; craig wasn't after the money in the first place. they were both just out to prove a point (goldman, that truly random data can't be compressed in the general case; craig, that goldman's challenge was amusingly open to abuse), not to make any money.
  • You could scan the file for the smallest possible number whose binary representation isn't contained in the file,
    i.e. let's say 1001010111001010101 (I know this number is too short) is not there replace the 1000 zero-bits with that number.

    But I fear now it only get's more difficult to calculate the probabilties of a "win", but the outcome will stay the same.

  • No way...the guy didn't compress shit. All he did was some data shifting. If the guy had read the FAQ, he would know that such tricks are not mention an MSDOS FAT filesystem. Even on a FAT filesystem, the 218 (or whatever) files and the decompressor would have taken up more space than the original file due to slack space...although the amount would vary from hard drive to hard drive, if you figure an average of 512 bytes of slack space per file thats 111,616 bytes of slack space, far more space than Patrick claims to have "saved."

    And anyways, the guy basically admits that he cheated by exploiting the vague wording of the challenge.

    I've got to hand it to Patrick, though, because if you do take the challenge out of context, he basically depends on what the meaning of the words "compressor" and "decompressor" mean.

  • The 218 files would have taken no additional space on the FAT, as it's fixed size.

    Further, you can format tracks down to 128 bytes per sector, leaving an average of 64 bytes of dead space per file. That's 13,592 bytes of dead space, if we store the files "conventionally".

    I never said that it would take additional FAT space, I'm talking slack space...if you knew anything about the FAT filesystem (which you obviously don't), you would know that its not sectors you have to worry about, it's clusters (GROUPS of sectors). there are only so many clusters because of FAT's fixed size... I was being QUITE generous in saying that 512 bytes is a practical average. On a 1.2 GB MS-DOS formatted drive (not FAT32, mind you, we're talking DOS here, not Windows), the cluster size ends up being something like 16K, leaving the average slack space to be 8K. so I was being VERY generous there.

  • The hard drive platters were free. The coffee can lids might have worked better though because they are lighter in weight, although they wouldn't have been as rigid, so I'm not sure how well they would have supported the weight of the vehicle...

    I guess I could try it, though. :)
  • Again, that depends on what you mean by "paper airplane." Admittedly, I won a similar contest involving rubber-band driven cars. The contest rules said that CDs or DVDs could not be used for tires due to the fact that they make for very narrow tires, thus having reduced friction. So I used hard drive platters. :-)
  • He never said it was okay to deliver a whole bunch of 'compressed' files. He said ONE decompressor, and ONE compressed file. Though Patrick asked about this, all the guy did was restate his terms.

    What Patrick did was take a small bit of information out of the file, and, rather than having a table of 'offsets' telling his decompressor where to insert 5's, he broke it up into a bunch of files, effectively using the directory/inote tables to store this information.

    Had he not insisited on a single file, I'd say Patrick here would have technically won, though not at all in the spirit of the competition.

  • The data has to come from the guy accepting the challenge; you get to tell him how much data to compress, he gives you the data.
  • Hmm. okay.. point taken.
    Guy should cough up 5 big ones then.
  • In particular, the filename encodes information about the order in which the files need to be reassembled; so for each byte saved by the "compressor", several bytes are needed (on average) to encode the ordering as an ascii integer in the filename.

    The challenger should have made it clear that these costs would be counted against the reconstructed filesize (or simply not allowed any deviation from the "one file" rule) The decompressor should operate by having all the files piped into stdin, thus eliminating filesystem state info entirely.

  • I've heard this idea conjectured a lot by various people; but although the mandelbrot set is in some sense infinite, the set of images that can be produced has bounds. Basically, choosing any fractal, and sampling at any resolution, with any slice through the set, still won't produce the entire possible set of data that could be represented in the finite space that you've chosen. Due to the sampling, you would repeat some data and never produce others. In fact, the self-similarity properties of fractals would seem to indicate that this would be a rather poor (or inefficient) way to represent general data.
  • I would argue that the decompressor needs the filename numbers in order to work (for ordering) It CANNOT work on a bitstream of the "compressed" files alone (even if the ordering is preserved), so it is dependent on the numbering in the filename, and thus, their size needs to be counted as part of the decompressor. If you add in these bytes, then the decompressor (file plus filename numbering bytes) and datafiles are larger than the original file.

    So, after some consideration, I would say that Patrick didn't succeed, even considering the specific wording of the challenge (and the discussion in the emails).
  • NEVER does Mike say "file size". He uses the phrase "total in size". In fact, here is a quote:

    In order to meet the challenge you must provide to me a compressed file and a decompressor which together total less than the size of this original.dat, and which can generate a file identical to original.dat.

    He doesn't say "file size" in this or any other passage. It is Patrick who says (and assumes) this.

    Also, Patrick specifically addresses the issue of filename based compression in this passage.

    It's probably also possible to meet the challenge with smaller file sizes by storing information in the filename of the compressed file or the decompressor, but I think most people would consider this to be cheating.

    Thus, I think it is Patrick who hadn't considered the fact that his "decompressor" is not only just the 'dfi' script, but also the filename information used to indicate file boundaries and ordering. Like I said, his algorithm (much less his actual implementation) could not properly reassemble the file if it were fed all the properly ordered "compressed" files as an input stream (through a pipe, say), since it wouldn't know where to place the missing fives. So given that Mike said "total size", in all cases, when discussing and offering the challenge, I think his position is safe.

    Now, I do think that Patrick was reasonably clever, and that Mike was overly smug in both the challenge and the initial news posting. And in the exchanges Patrick did explictly ask about file sizes, but I do not fault Mike for not clarifying further, since he had no a priori knowledge of the scheme, and therefore didn't know he would have to be so specific. But his use of terminology was consistent, and I think, appropriate, to protect against this type of scheme.

    Anyway, that's just my take on it. Mike can hopefully learn some humility from this, and Patrick can learn that his "decompressor" is more than just the sum of its file bytes.

  • by ChadN (21033) on Tuesday April 24, 2001 @06:49AM (#269055)
    However, the decompressor DOES depend on the filename numbering in order to work; it is part of the algorithm (and hence the "decompressor"), and those bytes (the bytes used to number the files) should be charged against the decompressor size. Then Patrick loses.

    While not explicitly stated, those bytes ARE an integral part of the "decompressor" that was supplied (it can't work without them), and so including their size is consistent with the agreed upon rules, IMO.
  • As soon as Mike accepted Patrick's money, it was no longer a rhetorical challenge. It was a bet. Patrick, at that point, had caught Mike proffering a reverse sucker bet. Mike says "I bet you $100 at 50:1 odds that you can't do X." Patrick, confident that he could do X, accepted the bet and then did X. Mike says "Oh, I really meant for you to do X.1. I only tricked you into trying to do X because it's impossible." Patrick, in a really amazing display of good sportsmanship, declined to have his money refunded. (Which Mike wisely proffered...otherwise, he'd be in serious danger of fraud...)

    My momma told me never to take a sucker bet. Mike got caught by his OWN sucker bet...which takes a) greed and b) foolishness. He got lucky that the person who took him was such a good sport about it.

    It'd be really nice to see Mike donate the money to ACM or EFF or somebody like that...that'd be a pretty classy way of taking care of this.
  • I'm not necessarily arguing a point of law, it's more of a point of honor. Wagers are well-understood in many cultures, and in this case it seems clear that Mike proffered an arguably dishonest wager (a bet where he challenged people to do something that was not possible to do to his satisfaction). Somebody hoist Mike by his own petard. Good for Patrick.

  • Yeah, my exceptionally slow compressor counts the number of "one" bits and the number of "zero" bits. It stores those two numbers, along with several position-dependent checksums. Compression is very fast.

    Decompression can take roughly the lifetime of the universe (for large files).

    Interestingly, my brilliant strategy is also a encryption algorithm ;)
    bukra fil mish mish
    Monitor the Web, or Track your site!
  • (should have said "slow de-compressor" but you get the picture)
    bukra fil mish mish
    Monitor the Web, or Track your site!
  • Lzip [] should have gotten $5000 for being the best compression program.

  • Too bad the Python doesn't get executed.

    Maybe in Squishdot, as long as rexec is used, of course. You could import StructuredText from Zope and your post could act like a Wiki.

    - *this* would look like this

    - **this** would look like this

    The above would become an actual HTML bulleted list.

    || Instant Table ||
    || 1 || 2 ||

    Ah, well. :-)*
  • I think that a person's stance on this issue, of who was right and who was wrong, will indicate that same person's approval of lawyers and much of the practice of the law today.

    Data point: The corolation for me is: modern US law as practiced is harmful, and anti-productive; this contest was clearly stated, and clearly won. Mike should pay up.
  • But it was clear. The contest was in terms of total file size of the compressed output and decompressor program.

    The proviso given (which should not have been) was that the compressed file could be provided in multiple parts. What he forgot is that an ordered sequence of files has more information than just the sum of their contents. Ooops.

    This is not a wording problem, but one of a poorly thought out bet. This guy gave 1:50 odds that no one could compress truely random data. He chose to allow conditions that made the bet doable, and then did not want to pay up when the bet was won.

    Actually, even without using multiple files, I think the bet's doable for any single given data file (though you cannot write a generalized compressor for all truely random data). I should come up with the submission to solve for this guy's challenge.
  • This was a great story. It reminds me of Sky Masterson's soliloquy from Guys and Dolls, in rejecting a proposed sucker bet about the number of sales for various pastries at a deli:

    "Let me tell you a story. When I was a young man about to go out into the world, my father says to me a very valuable thing. 'Son,' the old guy says, 'I'm sorry that I am not able to bankroll you a very large start. But not having any potatoes to give you, I am going to stake you to some very valuable advice. One of these days in your travels, you are going to come across a guy with a nice brand new deck of cards, with the wrap and seal unbroken, and this guy is going to bet you that he can make the Jack of Spades jump out of the deck and squirt cider in your ear. But, son, you do not take this bet. For if you do, as sure as you are standing there, you are going to end up with an ear full of cider.' "

    Sky, an extraordinary proposition gambler, responded by countering. He covered the offeror's eyes and bet he couldn't tell what color was Sky's tie. "No bet."

    I say this to point out the difference between sucker betting and proposition gambling. It is a sucker bet to offer a bet challenging a mark to do something that is, in fact, impossible. In proposition gambling, the offeror gives the mark a proposition they are unlikely to win, or that is worded in such a clever way that the winning of it leaves the mark too embarassed, entertained or impressed by the cleverness to attempt to challenge having lost.

    In this case, the first offer (compressing high-entropy data) was a sucker bet. Shame on him. The response, doing the sucker bet using multiple files, was a sweet proposition gamble. Shame on him. Once taken, the offeror-now-mark did weasel, but as I see it, ultimately welched.

    I admit that I, too, didn't consider the import of the multiple files on first reading (but, hey, I didn't have $5K against $100 riding on the result).

    At the end of the day, the sucker-bettor was the mark, and he had an earful of cider.

    No sympathy.
  • It is my understanding of the situation that Patrick "solved" the challenge by exploiting the fact that multiple files store more implicit data in the filesystem than a single file does. I believe the only such data he used was the EOF.

    This is considered "cheating" because he used 220 EOFs that are not counted in the filesize.

    So, my question is this. If a contestant submits a single compressed file and a single decompressor file which are just 1 byte under the original file, does Mike void the entry because of the extra filesystem space of two files? If not, why not? If so, shouldn't the rules say that specifically, as two is the minimum total number of files one can submit?

  • by PhiRatE (39645) on Tuesday April 24, 2001 @04:41AM (#269076)
    Its an intriquing problem. I certainly wouldn't stake $5000 on nobody being able to compress the data I supplied, good random number generator or not. Why? because while the concept that random data in general cannot be compressed is correct, there are no patterns, it is not necessarily true that a given set of random data cannot be compressed.

    For example, a file full of 0's, is perfectly possible from a true random number generator, and just as likely as any other set of numbers. As any good geek knows, a file full of 0's compresses very very well.

    In this case, I think the challenge has one major problem, and that is that there is no real time limit on the entry, and no requirement that the decompressor work effectively in general, it can be entirely specialised.

    This means that the problem *may* be solvable, simply by throwing ungodly amounts of CPU power at it. With the appropriate code, it would be possible to simply hunt for unintentional patterns in the data and hope. If you're lucky, there will be enough indexable patterns present that it will outweigh the cost of the indexing, and the code necessary to replace the index points.

    If you're unlucky, the data will have no patterns in the space you searched, and therefore you're out of luck.

    This is why the guy noted that if he got a big enough file, he could win the competition. As the filesize increases, it becomes increasingly likely that there are, accidentally, enough patterns in the data to cover the space required for the decompressor and the dictionary. Of course, as the filesize increases, you also have to burn considerably more CPU in order to locate the patterns in the first place.

    Personally I think it would make an excellent project for a research team, not particularly for the money, but in order to create some really fast, specialised code for locating patterns in assumedly random data.

    I am not aware of much effort in that area, most compression research goes into identifying patterns inherent in the data type, waveforms in audio, visual patterns in images, duplicate strings in text and binary files. It would be very interesting to see just how effective fast analysis code and heavy CPU is against the problem.

    I suspect there may well be some crossover here with Cryptoanalysis, since one of the primary jobs in breaking crypto is finding the patterns that differentiate the ciphered data from pure randomness.

    Any links appreciated.
  • by rjh (40933)
    Depends entirely on how random the random data is. There's a surprising lot of random numbers which can be compressed--this almost always indicates a critical flaw in the random selection process. Even radioisotope-based RNGs aren't immune; by knowing the isotope, its age, and the sort of radiation detector being used, it's often possible to get a faint hint of determinism out of the RNG. That faint hint of determinism can be exploited to result in compressability... sometimes.

    By and large you're right that random data can't be compressed. But getting really random data is a far-from-trivial undertaking.
  • A while ago on an article was posted about a prime number that had the "magical" property of being able to be g'unzip'd to the DeCSS source. The number was represented as a (relatively short) arithmetic expression.

    My question is, has anyone put any efforts towards seeing if other, larger pieces of data could be represented in this way? Of course the chances of finding these type of numbers are very slim, but they *are* infinitely large spaces, and the compression would be ultra-compact.

    Maybe something for a quantum computer to chew on anyway...
  • There is another numerical analysis challenge thats been around for several years and has only had one winner of the $100 prize. Check out the Weak Signal Challenge []. You need to recover a morse code signal embedded in noise. And you don't need to send any $$$ to enter, just download the WAV and start number crunching :)


    Free Database Hosting []

  • The problem is that Mike said it was OK to use multiple files, but did not specify whether that meant that the actual additional filesystem space that they took up would be counted against Patrick anyway (with the idea that Patrick could use many files, just so long as the eventual information still added up smaller than the original), or whether he meant that it shouldn't make a difference, and didn't care.

    The information that could be considered "removed" by the files, would basically be the start and end offsets of the chunks, if they were all in one file...which would naturally add up to more than the original (219 * 2 additional info). The "missing" information escaped into the unix shell and filesystem, through the facility of "cat" being able to print out a block of bytes given a single piece of info.
  • POT appears not to mean that Slashdot will *interpret* your post as POT, but instead that Slashdot *assumes* your post to be text that is not meaningful HTML. They should changed it so any HTML is appropriately escaped (entitied) from POT submissions. Angle brackets, etc, are very annoying to have to fix.
  • ie, it would confirm what I say every day.
  • Since the challenge allows you to see the file before building the compressor (and the neither the size or speed of the compressor is relevant), all you need to do is find the single array of bytes (I'd say string, but presumably the 'arbitrary file' would be binary) greater than three bytes, that occupies the greatest percentage of file space in the compressed file. Then replace that array wherever it occurs with an EOF character and name each new section of the file sequentially using a 1 byte name.

    (As noted in another post) As long as you pick an initial file length long enough that 2% of the time there statistically should be enough repeats of some 3 byte (or longer) array to equal the number of bytes in your decompressor (which would be roughly the same size as Patrick's), then you will win the challenge enough times to make it profitable.

  • Nice try, but won't work. You have to keep the position of that serie of '1'. That position will use about log2(position) bits. Unsuprisingly, 2^n is the size the file needs to be to have a good chance to have a serie of n '1's.

    Not if you expand on Patrick's original idea replace the series of 1's (or zeros, your compressor should determine the longest continues series of each and then use the longer) and replace it with EOF. Since EOF can't appear in the original file except at the end, your de-compressor will know that this is the position to insert the original series. You could also repeat this as many times as like, provided the replaced series is at least as long as length of the EOF character plus the length of each new file name created plus memory space needed to store the value of the length of the series.

  • I think that specifity is important. Any software engineer will tell you that

    [ ... ]

    The english language (in common use) is suitably ambiguous to make that impossible

    Specificity is critically important when you're communicating with a computer. But over specificity is actually a deterrant to communicating with another human. Ambiguity is a common mechanism by which we humans learn. So, for example, when I say "cat" you have one image in your mind and I have another image in my mind. But when we both discover that the thing that I was talking about is different than the thing you were thinking, we'll have expanded our understanding of what the word "cat" comprises. Just now, I was thinking of a cheetah. Do you remember when you learned that cheetahs and lions and tigers were also cats?

    I'm overstating some things here becuase it's been about 8 years since I took my philosophy of language class. But to sum up, generally human communication works much more efficiently when there's some ambiguity in it.

    Which leads us to where we are - I don't really like it when a silly case wins in court due to a loophole, but I see no alternative

    But it's not just in the event of silly cases. You are cornered by the silliness everyday. Clearly you use a computer connected to the Internet. Do you think that the terms of service that govern your connection protect you? The point is that if we can't figure out a better way to deal with specificity and intent, then we consign ourselves to the rules that are written by those who can afford to write them.

    I don't have an answer for this either, but I'd like to think that the discussion isn't so easily dismissed.

  • by mjh (57755) <mark.hornclan@com> on Tuesday April 24, 2001 @05:08AM (#269099) Homepage Journal
    I think that a person's stance on this issue, of who was right and who was wrong, will indicate that same person's approval of lawyers and much of the practice of the law today.

    If you are for the challengee, then you believe that exact specificity is more important than intent. And you inadvertantly empower the legal profession to profit from being better able to specify rules, etc. Which, in turn, gives those with money even more power. They can write the rules by which your interaction with them will be governed. You can be certain that the rules will be written to the advantage of the guy with the money.

    If, on the other hand, you are for the challenger, then you believe that his intent was more important. (Considering that the challenge came from the compression FAQ, the exact definition of "compression" should have not been vague. It seems clear to me that his intent was obvious, even though his challenge wasn't very specific.) Still there is a problem with valuing intent over specificity. That is it's easy to misrepresent intent after the fact. Thus the need to clearly specify our intentions prior to anything we do. That being the case, with accurate documentation, it seems possible to me that we can discern the original intent w/out having to specify everything legalistically. For example, I think that I was able to clearly determine the challenger's intent well after the fact, and I did this having read the challengee's representation of the events.

    Now you may say, in rebuttal, "Who cares who I choose? I don't get to make the law." But think of yourself as the jury trying to determine to whom you would award the money. You may one day find yourself on a jury and trying to determine who should be awarded money. I see this as a microcosm of exactly how our court system has gotten into the state that it's in. Specificity is clearly more important than intent. And it's that way because juries award that way. Since juries are comprised of members of society, it's not a far stretch to say that society as a whole regards specificity as more important than intent.

    Is there any chance that we can change this? And if we did, would it be better?

  • Since you have to provide the algorithm for uncompressing too you can regard the machine (OS/shell+tools) as one huge decompressor. It does f(compressed data+algorithm)= decompressed data. Simple counting yields that since there are at least as many (compressed data+algorithm) pairs, as there are sets of decompresssed data. Even by choosing only the shortest versions at least some of those pairs have to have a larger total length than the decompressed data (not accounting for tricks with hiding data in EOFs and filelengths).
  • And here [] if needed.

  • The quote everyone seems to avoid is:
    I am curious, though, and I am not trying to get any sort of angle on you here, what brilliant idea have you come up with that makes you think you can compress arbitrary data? Whether or whatever you disclose won't affect this challenge and I will live up t my end of the bargain regardless.
    That sums it up for me.
  • I have to agree with this. The trick is Given a sufficiently large file. I can't imagine anyone coming up with a 2GB file off of which a modified RLE program couldn't snip a few kB. By "modified", I mean spend all the time you want analyzing the file (this is allowed), and find strings of similar bytes in arithmetic progression, or in strange but calculable patterns. Remember, the compressor can be as complicated and large as possible; it's the decompressor that needs to be small.
  • > however you have no way to represent non-compression resistant files.

    The challenge specifies that you can choose your compressor after you had a peek at the datafile. If it turns out to be compressible enough by a known algorithm, just ship that compressor. If it isn't, use your meta-compressor.

    However, I have a gut feeling that the uncompressible files are too numerous for this meta-compressor to work: if 99% of all possible files are uncompressible, then the index number of the file would not be that much smaller than the file itself, and you couldn't fit the meta-compressor's code (which would be rather complex...) in the small space saving.

    As others have pointed out, a better alternative would be to simply "play the odds": make an algorithm which only works in 10% of the cases. If the file can't be compressed, resubmit an entry until you get one that works. On average you would have spent $1000, and won $5000!

  • At first, when I was reading the page, I felt that Patrick clearly cheated, given that he exploited the filesystem to store data. However, after reading your comment, I must agree. Mike should have clarified the rules. Clearly, Patrick deserves the cash.
  • Clearly, this is the crux of the arguement, and intellegent people can waiver here. In a sense the fragments were compressed in that the 5's were chopped off. I would actually argue that this is a form of compression, however impractical in this case. For example, if you create a dictionary of "phrases" and install the dictionary within the decompressing client, and send the "compressed" files you are really doing the same thing that Patrick did. The difference, is that Patrick exploited the filesystem to store his dictionary data, as opposed to actually putting the dictionary inside the decompressor. In fact what he did IS compression, and the files were compressed in size. Patrick could have done even more devious techniques by moving all of the data into the filenames themselves. Because Mike did not spesify that external linkage to datasources was illegal, what Patrick did in my view was within the confines of the contest.
  • One solution however impractical an imperfect is to reverse-checksums. Essentially, you take the data block and perform a series of running checksums (like CRCs). You store only the checksums in the "compressed" file plus tidbits of "hint" real data. What you have is a lossy form of compression. However, with enough processing power (a few dozen E10,000s) you can reverse calcuate the checksums which will give you accurate results with 99.999+% of data blocks (some data blocks will return false solutions). It is simply a matter of home much processing time you are willing to throw at the problem.
  • Aside for not clarifying the use of external dictionaries as being illegal (Mike let Patrick use the filesystem filenames to store data), Mike also didn't clarify processing time. If processing time is not a factor, you can create reverse CRC/Checksum compression that is acurate almost everytime. Reverse-crc compression is accurate inversely porportionate to the rate of compression. So, you simply need to choose an large data block (1 gig or so) to compress with a very low rate of compression, say 1000/999 or something like that. Then you let the reverse crc decompressor work on it for a few days, and eventually you'll come up with the right data 999/1000 times.
  • by selectspec (74651) on Tuesday April 24, 2001 @05:06AM (#269128)


    Patrick I meant can I send you a compressor and several compressed files whose total file size is less than the original uncompressed file and from which I can regenerate the original uncompressed file

    Mike Sure.

    Pay up Mike. Pay up. You were too careless to plan out the rules and think about your intent before you threw 5k into the sink. Cough up the cash or loose all honor.

  • Let us disregard the issue of money and the issue of a specific answer's acceptability and analyze the question

    In my opinion any person, given enough time to come up to speed on the subject, and given that that time, plus the time it takes to come up with and implement their solution all multiplied by the factor of time it takes to maintain themselves (eating, resting etc.) during that time, and given that they live that long, can come up with a winning solution to Mike's challenge due to a flaw, not in wording of the challenge, but in Mike's reasoning behind the challenge.

    To put a lot of discussion to rest (in my mind) just read the relevent portion of the FAQ in which the challenge is found. i.e. the section entitled "9.2 The counting argument []"

    Here you'll find, clearly stated, the flaw in trying to find any one algorithm which will always compress data of a fixed length to less than that length. Also you'll see the decent challege by Steve Tate which paraphrased says:

    "give me your best algorithm and the size of data it works on, and I'll give you one counter example of data in that size which will not compress with your algorithm."
    Mike Goldman's addendum challenge effectivly states:
    "I'll give you data in any size you request, you give me data and a program which in total are smaller than the original size and for which the program applied to this data produce my original data."

    Reading it in its full context, which aparently a lot of slashdotters didn't, it immediately gives itself up as a challenge which does not hinge on on the mathematical theorem it is presented with. i.e.

    Given, there is no one algorithm which can compress all data, nor a subset of data selected by being a fixed length.
    Therefore, if you give me an algorithm which claims to compress all data or all data of such a subset, I can give you at least one counter example.
    It does not follow that I can give you one counter example to all compression algorithms for a given length (of your choice).

    The implication of the first challenge is that the poser is confident in both the mathematical theory and either his analytical abilities when studying the given algorithm, or in his ability to test and check against the given algorithm in a sufficient period of time.

    Conversely, the implication in the second challenge is that Mike in fact is the possessor of a set of super-data, of which any sub-set cannot be compressed by any algorithm.

    Since we live in a time when any data sufficiently limited in size yet sufficiently large enough to offer decent gains when compressed (the challenger may pick that size) can be analyzed fully and with great speed, I think Mike has made a bit of a fool of himself.

    I am not saying that Mike can't supply <<the uncompressable data from hell>&gt, but he will have to go to some lengths to get it. It will likely be selected to be large enough so Mike cannot generate it himself without an algorithm. It cannot be generated with a small input set or that set could be discovered and when coupled with the generator would defeat the challenge. It may be generated from a large set, but the result must show no patterns that can be bitstuffed (see networking) against or more simply generated and indexed for, nor may the source set contain any of the same properties. It may be possible that Mike is collecting this data from a source, i.e. digitizing random radio signals; we may all be interested in any source data which cannot by any reversable algorithm be represented/stored more succinctly particularly without any regard for retaining its meaning while compressed.

    In summary, if Mike doesn't want to be giving away money, then he doesn't understand his own challenge's failure to be based on any mathematical certainty (unless I'm wrong and he has some super-data). However Patrick's solution was clearly not compression, though it was the start of a decent form of compression. Had concatenated his file-set and headed it with information as to each file's length in order, he just might have saved space enough for the decompression, particularly if he picked a suffiently large recuring patern that could be generated rather than having to be stored verbatim.

    However, I might just be able to claim, that if the universe is deterministic, if the input conditions for the creation of the universe could be represented in less data than a generated block of data which can be represented in that universe, and if I were given enough resources in which to recreate the universe, I should be able to offer the input conditions of the universe and an index into it's ongoing generation which could point-to a generated block of data which exactly matches one provided. I just might be able to give you the process, its inputs, and your data before you've finished generating the data for my participation in the challenge, if in my generated universe I could significantly warp time without affecting the results.


    That last part's an odd joke

  • I tend to agree that it should be always possible to get the $5000 from this "challenge". You can see my extended thoughts at: article 403 []

    I'm surprised at the number of people using the FAQ's "pigeon hole" theorem backwards. The theorem states that a reversable algorithm cannot be used to represent every piece of data as a piece of data smaller than itself. It does not address being able to pick an algorithm well suited to just one specific piece of data. If there is any data to which no compression alogrithm can be applied to, tailored to, or made for, that's a pretty amazing piece of data, and we should all be made aware of it Mike.


  • The reason I said "to be disbursed at my sole discretion" in the first sentence of my rocketry prize [] was to make it clear that I was acting as the arbitrator of my own "contract" with the public and that therefore if anyone wanted to play "games" with me, they would quickly discover they were playing with themselves.

    Goldman's mistake was two fold:

    1) He didn't put a phrase like "to be disbursed at my sole discretion" at the start of the description of the prize award.

    2) This he failed to do so within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.

  • And of course, I had at least a one-fold mistake in my comment on Goldman's second of his two fold mistake -- which comment should have read:

    2) This he failed to do within an arena where the distinction between "tricks" and "solutions" is frequently nonexistant: mathematics.

    Such mental typos being made at my decaffinated 6:54 AM brain's sole discretion.

  • If someone had a genuine solution -- something that actually fulfilled the intent of the challenge -- they would be likely to submit the $100 because the odds that Goldman was not scamming the compression group would be vastly higher than 1/50. :)
  • The guy was clever and found a loophole in the rules of the challenge. He even said so.

    The point of the story is to be careful when issuing so-called "impossible" challenges. Don't leave loopholes in the rules!

  • by Apotsy (84148) on Tuesday April 24, 2001 @06:06PM (#269145)
    Exactly. "Pompus ass" were the first words to come to my mind as well.

    When Goldman gleefully posted to the newsgroup that someone had accepted his challenge, he said "some people just don't understand information theory too well". Yeah, and some people just don't understand issuing challenges very well, either!

    When you make a challenge like that you absolutely must be sure to word the rules carefully and try to avoid leaving any loopholes. In this case, all Goldman really needed to do was tack a sentence at the end of his original challenge saying "the data must actually be compressed according to the definition of 'compression' contained in this FAQ. Challengers who merely point out loopholes or exploits in the rules of the challenge will not be rewarded".

    The moment Mr. Craig started iquiring about about the specifics of the rules and saying "wait and see" regarding his methods, that should have been a warning sign to Goldman that Craig had some trick up his sleeve. But no, Goldman was just so in love with the idea that he found some "impossible" task which suckers might try to undertake that he couldn't be bothered to think about such things.

    What's even more embarassing to Goldman is his initial reaction to Craig's submission:

    I have accessed your server and downloaded exactly two files, the "decompressor" (dfi) and the "compressed file" (comp). The rest of the files in the directory (comp.0 through comp.217 inclusive) I have not touched.
    Even though Craig had already asked if more than one file was okay and Goldman had said yes!

    And Goldman's following point:

    In further point of fact, the 218 parts together must occupy more space than the original file (original.dat) that you were given. Each file requires space for a directory entry, and each such directory entry requires in excess of one byte of space. Thus you have in actuality expanded the data which you were given to compress.
    is even more ridiculous given that Goldman had already said that "sure" in response to Craig's question about multiple files whose "total file size" (which does not include directory entries) are less than the size of the original data file. Goldman tried to change the rules rather than admit he was sloppy in setting them up.

    Goldman also says "I think that I did not make a mistake", even though he admits in the same message that he needs to restate the rules:

    Perhaps it will be necessary to restate the terms in a legalistic form which is not open to this sort of miscommunication...
    "Perhaps"? How about "definitely"! And any "miscommunication" was entirely one-way. Craig was very sly and clever, and Goldman was too full of himself to realize what the fuck was going on until after the fact. A "pompus ass" indeed.
  • I think in his post to comp.compression Mike wrote that his random data had come from []. Random doesn't use a PRN generation algorithm; they get random data from atmospheric radio noise (tune your AM radio to a non-station and you'll hear what that means.) To remove any bias in the data they postprocess the noise by removing high order bits, and the low order bits are shifted to normal using a bit discarding algorithm.

  • The challenger was mistaken, he thought that he was offering an impossible challenge, but was mistaken.

    The comp.compression fact discusses that for any given compression algorithm, it is possible to craft an input fill that cannot be compressed.

    This is different than the challenge proposed, where Mike, the challenger, provided a file of size X, and Patrick, the challengee, was required to provide a de-compression program of size Y and a data file of size Z, where X > Y+Z.

    The difference here is that Patrick was allowed to hand-craft the compression method to the data file, but Mike was not allowed to hand-craft the original data to the compression routine.

    Basically, Mike was wrong, and should pay the $5K.

  • The quest was for 'Lossless data compression', but the challenge was presented backwards-

    Given a sufficently large source file of truly random data, I can create a binary and a data file which can recreate that one specific source file, losslessly, with my binary+data being smaller than the source. That is, I can create a routine that will work for just that one specific source file, no others.

    The problem of 'Lossless Data Compression' refers to the opposite case- given any general purpose routine to compress files, I can create a file that it cannot compress. That is the oppposite of the challenge given, and actually is impossible.

    As stated, even with the restriction of a single compressed file, the challenge *IS* winnable.

  • There is a huge difference between 'have the ability to compress randomly occurring data' and 'have the ability to compress the specific file Mike supplies'.

    It is impossible to write a routine that can losslessly compress every possible 3-megabyte file that could ever be generated.

    It is not impossible to write a routine that can compress the one specific file that Mike supplied to you. That is, it is quite likely that you could write a program that can losslessly compress that one single input set, such that:
    length of decompressor) + (length of compressed data)

    Why, he'd just reverse engineer the decompressor, patent it, get rich, and buy an island inhabited with young bikini clad virgins before the your check had even cleared.
    That would only be true for the general casse compressor-decompressor. I could write a routine that would work for that one specific input file, but fail miserably on any other input. This would win me the $5K, but be valueless for any other purpose.

    econdly, i've been thinking about this problem for about 3 years now. :) I've tried so many methods and ways of interpreting and looking at random data, i'd have enough to write a book if i'd bothered to keep accurate logs. If i had $1 for every time i said "I'VE GOT IT!" i'd probably have $5000 already. Well, maybe that's a tad off the mark, but anyway, you get my point. Right, onto my conclusion...

    The mistake both you and Mike make is that while writing a program for the general case of compressing random data is impossible, it is very different than writing a program to compress one specific file of 'random' data.

  • Slightly over-zealous use of "Submit" without previous use of "Preview", methinks. ;-)

    BTW, you can use &lt; to get "less than", like this:


    This is very very standard stuff in HTML.

  • Heh - interesting: I've never used anything other than "Plain Old Text" (including on this message - but spot the italics!).

    Strange... Hey ho, live and learn I guess.
  • def test():
    print 'so what do the other modes do?'
    return("cool, works the way I'd hope")

    if __name__ == '__main__':

  • But tomorrow morning, I shall be sober.

    Oh, hang on, that's not right.

  • I was very interested by your comments regarding the difference between specification and intent. However, I think there are some other factors in play here.

    You mention intent - let's look at the challenger's intent. It was NOT to find an amazing new compression scheme, in other words this contest was not meant to sponsor research in the field. He set the challenge PURELY to get one over on everyone else, to shout "told you so - ha ha ha" like some kiddie cracker. As he's essentially trying to catch people out, I think the onus is on HIM to make sure he specifies things EXACTLY. What has happened is the smartass challenger has had his whole scheme flipped up and fired back at him, which I personally relish and applaud.

    I think that specifity is important. Any software engineer will tell you that - the number one quote from a user when you deliver your system?? "Oh, I didn't MEAN I wanted it to do that...". You may think you know what the other party intended, but how can you be sure? The english language (in common use) is suitably ambiguous to make that impossible. (An aside - I'd love to know if developers in non-english speaking countries have the same problems, is english particularly ambiguous??). A lawyer would tell you that's exactly what they are trying to do, deduce intent from the actual written law (or contract or whatever). The two sides have differing interpretations, which are simply two different models of the original intent, deduced from the same written document. Where do you draw the line between the "quest for truth" and 2 bit scumbags just trying to pick holes in perfectly good laws? The simple fact is I don't think you can.

    Which leads us to where we are - I don't really like it when a silly case wins in court due to a loophole, but I see no alternative. People will disagree, they will always disagree, so how do you settle it? If you don't go with exact interpretation of the letter of the law, you end up having to go with an interpretation of the spirit of the law, which puts total (and to me unacceptable) power in the hands of whoever happens to be making that interpretation. At least an argument over english linguistics is transparent and explainable. Hopefully the inclusion of a jury in many cases helps to add the "human touch" to the process - at least I know it did when I served recently.

  • You know, I reckon it's actually *always* possible to win $5000, no matter what file is generated, providing it is of sufficient length. This is because knowing the file to be compressed *before* you create the decopressor gives you an advantage over the arbitrary random file case.

    Furthermore, I reckon I can write a program G that will, when given any input file A of length>=X, will create a decompressor D and compressed file C such that:

    D(C) = A
    size (D) + size (C) size (A)

    Don't know what X is yet, I supspect around 1 meg is fine though.

    Anyone willing to bet either way?
  • Slashdot apparently doesn't like open angle brackets.

    the condition should read:

    size(D) + size(C) "is less than" size (A)

    Slightly over-zealous HTML filter, methinks....
  • Uh, just spotted that isn't possible. Cos then you've created G(A)->(D,C) which is a universal compressor, and therefore not possible by your usual arguments. Course, I'm still interested to see if you could create G which works for a percentage X of possible values of A that gets arbitrarily close to 1. My guess is that this is possible, in which case you can still win money off the original bet by ensuring that X>0.02 Of course, if your opponent knew G, they could always choose a value of A that would confound it. But this isn't the case here. Also, it may be that X increases only very slowly with filesize, so that in order to get X>0.02 you would need a few solar systems worth of storage and CPUs to complete the challenge. Can anyone prove/disprove any of this?
  • Anyone willing to bet either way?

    Yes, assuming appropriate restrictions against storing information in the compressed file's name or attributes, I'll give you N:1 odds that you cannot write the program G, where N is any positive number. X can also be any number.

    Or, save yourself time, don't try to write the program at all, just send me money.

  • It's not so easy to generate a file to resist compression. If you had and used a method for determining whether a file could resist compression, then I could compress your generated file simply by counting how many compression-resistant files preceded it in some sort order and writing that number as a binary file. Decompression would proceed by counting that number of compression-resistant files in the same order and writing the file found at the count.

    However, your decompression program would have to be smaller than M - log_2 (N) = log_2 (2^M/N) bits, where M is the number of bits in the compression-resistant file, and N is the number of compresion-resistant files preceding it in your sort order. So, the number of compression-resistant files would have to be small relative to the number of possible files if you want more than a bit to write your program. Since this is not the case, you would fail.

  • Nope, you don't have to record it, simply attach the part of the original file following the block of zeros to your decompressor program. The "compressed file" is the part of the original file preceding the block of zeros. Your decompressor prints out the contents of the compressed file, then 1024 zeros, then it reads from itself the remainder of the file and prints that out. The bet can be won; not in every instance of course, but in the long run.

    That reflects a mistake on the part of Mike Goldman. If you want your decompressor and compressed file to be a total of length N, you can store log_2(N) bits of information in how you set the lengths of the two files. If you have more than two files, you can store even more information. Mike should have required that (number of bits in original data)+log_2(length of original data) > sum of number of bits in each compressed file and decompressor + log_2(product of lengths of each file and of decompressor), and he should have permitted himself to rename or change the file attributes of any file he wished in case information was being stored therein.

  • Yes, or simply require that the contestant submit a single program (with prescribed file name) which can regenerate original.dat and is shorter than original.dat.

    In that case, the outcome of the bet clearly depends on the machine model. If the machine model is "complete Debian Linux running on Pentium", I would conjecture that the bet can't be won, but it is far from clear (obviously, a succinct powerful language is needed, maybe A+). On many machine models, the bet can be trivially won.

    No, the machine architecture doesn't matter. The bet can be won only if you are very astronomically lucky.

    I'd gladly stake large amounts of money on this, though it would feel like I'm running a scam.

  • by locutus074 (137331) on Tuesday April 24, 2001 @07:54AM (#269226)
    There is now a mirror [] available.


  • by clare-ents (153285) on Tuesday April 24, 2001 @04:05AM (#269242) Homepage
    Ok, I've done some information theory, I know it's impossible to create a perfect compressor. Boxes arguments and all that.

    However, to make money, you don't have to.

    All that is required is a compresser that saves 1 bit on better than 1 in 50 attempts.

    For example, if we can save 1 bit on 50% of the attempts - even if the other 50% of the attempts results it catasrohpically large files. On average we've lost the compression war.

    However, on 100 attempts, we've paid $100 x 100 or $10000 but we've won on half of these - netting us $250000 or a cool $240000 profit.

    Anyone want to demonstrate this?

  • by edp (171151) on Tuesday April 24, 2001 @06:57AM (#269259) Homepage

    "Since this is not the case, ..."

    Sorry, not so. The proportion of compression-resistant files necessarily decreases as file length increases. To see this, consider that if a file had a nice block of 1024 bytes of zeroes somewhere in it, we could easily write a simple program that wrote the bytes before that block, then 1024 bytes of zeroes, then the bytes after that block. This program would easily fit in less than 1024 bytes, plus the bytes before and after the block of zeroes. So such a file is compressible. I will show that the proportion of files that are not compressible like this decreases as file length increases.

    To make this easy, I'll count for compressible files just the ones where a block of zeroes begins at a multiple of 1024 bytes from the beginning of a file. The chance that a single 1024-byte block (selected randomly from a uniform distribution) is not all zeroes is (2^8192-1)/2^8192, a very tiny bit less than one. The chance that all n blocks of n 1024-byte blocks are not zeroes is (2^8192-1)/2^8192)^n. Obviously this approaches zero as n increases. Thus the proportion of files that are not compressible decreases as file length increases. Pick a large enough length, and you are virtually guaranteed there is a 1024-byte block of zeroes in it somewhere.

    When we open compression up to a wider range of techniques, the same effect occurs; the proportion of files that are not compressible under any given, reasonable technique decreases as file length increases. This makes Goldman's challenge a bad offer for him -- the fact that the average file size change must be positive under any compression technique in the limit does not mean the probability of a positive change in a randomly selected file is large enough to give him a positive return on the challenge.

  • by dachshund (300733) on Tuesday April 24, 2001 @04:35AM (#269345)
    > I meant can I send you a compressor and several compressed files whose
    > total file size is less than the original uncompressed file and from
    > which I can regenerate the original uncompressed file
    > Patrick

    Sure -- but you send me a *decompressor*, I don't need the compressor.

    Sure means Sure. As far as I can tell this is acceptance of Patrick's request. As Mike is the guy inventing the rules, I don't see why this wouldn't be binding.

  • by drew_kime (303965) on Tuesday April 24, 2001 @06:07AM (#269353) Homepage Journal

    If the guy had read the FAQ, he would know that such tricks are not compression

    Did the rules as stated say that the solution had to comply with the principle laid out in the FAQ? Not that I can see. If you're going to pup up a $5,000 challenge, you damn well better post all the rules.

    For instance: I had a friend who entered a paper airplane contest. He signed up for the "maximum distance" part. The only rules were that the contest organizers supplied the single piece of paper, you had five minutes to fold or tear it however you want to, and then everyone threw them at the same time. The one that stops moving farthest from the launch line wins.

    He waited until 4:45 into the time, then crumpled the paper up into a ball and threw it. While everyone else's planes were circling lazily around, his went straight, landed, and rolled several more feet.

    While you can debate all you want about whether his wad of paper counted as a "paper airplane" -- and we did debate (hey, it was college) -- he did saitsfy the rules of the contest. Had there been enough money on the line for someone to threaten legal action, I suspect he would have won.

A LISP programmer knows the value of everything, but the cost of nothing. -- Alan Perlis