AI Language Models Can Exceed PNG and FLAC in Lossless Compression, Says Study

AI Language Models Can Exceed PNG and FLAC in Lossless Compression, Says Study (arstechnica.com) 57

Posted by msmash on Thursday September 28, 2023 @02:00PM from the pushing-the-limits dept.

In an arXiv research paper titled "Language Modeling Is Compression," researchers detail their discovery that the DeepMind large language model (LLM) called Chinchilla 70B can perform lossless compression on image patches from the ImageNet image database to 43.4 percent of their original size, beating the PNG algorithm, which compressed the same data to 58.5 percent. For audio, Chinchilla compressed samples from the LibriSpeech audio data set to just 16.4 percent of their raw size, outdoing FLAC compression at 30.3 percent. From a report: In this case, lower numbers in the results mean more compression is taking place. And lossless compression means that no data is lost during the compression process. It stands in contrast to a lossy compression technique like JPEG, which sheds some data and reconstructs some of the data with approximations during the decoding process to significantly reduce file sizes. The study's results suggest that even though Chinchilla 70B was mainly trained to deal with text, it's surprisingly effective at compressing other types of data as well, often better than algorithms specifically designed for those tasks. This opens the door for thinking about machine learning models as not just tools for text prediction and writing but also as effective ways to shrink the size of various types of data.

AI Language Models Can Exceed PNG and FLAC in Lossless Compression, Says Study

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 57 Comments Log In/Create an Account

Comments Filter:

- Re:Is it an LLM, or... (Score:5, Interesting)
  
  by Rei ( 128717 ) writes: on Thursday September 28, 2023 @03:00PM (#63884111) Homepage
  
  Neural networks are self-assembling logic engines - massive chains of yes-no choices (neurons) that bisect an input space of a massive number of weighted inputs. Every layer lets you build on the yes-no decisions of the previous layer, on whose simple concepts you can build more complicated concepts, and ever-more complex concepts with each subsequent layer.
  LLMs transform an input sequence of tokens into an output sequence of tokens. Normally these are text tokens. But you can have the tokens be literally anything.
  There's really a whole category of research paper out there which I'd call "Wait A Minute, LLMs Can Do That Also?!?". This is a classic example of one. They're really swiss-army knives. And what's amazing IMHO is how generalized knowledge learned in one task can transfer over into completely different tasks.
  
  - Re: (Score:1)
    
    by geekmux ( 1040042 ) writes:
    
    While I appreciate your technical analysis, you have done fuck all to address my point.
    If LLMs are in fact limited in their very design, perhaps we shouldn't be even allowing Bullshit F. Clickbait to make a claim otherwise.
    Clickbait rules are about profits now, since Stupid spells Gullible wrong and doesn't give a shit. Every time.
    - Re:Is it an LLM, or... (Score:4, Interesting)
      
      by ShanghaiBill ( 739463 ) writes: on Thursday September 28, 2023 @04:04PM (#63884299)
      
      If LLMs are in fact limited in their very design ...
      Big "if". There are no obvious limits, which was Rei's point. LLMs and other NNs are finding applications that no one anticipated.
      Image compression is similar to finding the Kolmogorov complexity [wikipedia.org], which is non-computable. A standard fixed algorithm won't work well. So it's a good application for adaptive self-learning NNs.
      
      - Re: (Score:2)
        
        by geekmux ( 1040042 ) writes:
        
        If LLMs are in fact limited in their very design ...
        Big "if". There are no obvious limits, which was Rei's point. LLMs and other NNs are finding applications that no one anticipated.
        I love how you describe this, as if LLMs and other NNs are new breeds of wild dogs running in the streets that no one anticipated them biting onto new applications running around like victims. We have always anticipated machines doing it better than humans. It's why we ask the machines to do it now. A lot of it. It's pretty much the entire point of creating intelligence artificially.
        Image compression is similar to finding the Kolmogorov complexity [wikipedia.org], which is non-computable. A standard fixed algorithm won't work well. So it's a good application for adaptive self-learning NNs.
        When it comes to data management now or anytime in the future, forcing humans to find the delete button would be the most
        
        Re: (Score:2)
        
        by CaseCrash ( 1120869 ) writes:
        
        When it comes to data management now or anytime in the future, forcing humans to find the delete button would be the most efficient algorithm of all. Like no-machine-can-come-close better. Reminds me of the days when I would migrate users over to a new file server and force them to move their own data by a deadline. I'd consistently find anywhere from 50 - 80% of the data being abandoned and lost. Naturally I backed it up to cover several years of "Oh shit" events, because humans do that too.
        I don't know, I still have lots of files from old computers going back to the 90's including my first mp3. Do I look at them ever? No. But I have them. And I've got 140TB of disk space so I don't care.
- Re: (Score:2)
  
  by jhoegl ( 638955 ) writes:
  
  Asking the questions before it does what they think it will do, is just as futile as someone saying "it CAN do this thing", before it actually does the thing.
  
  This is more sales speak about a technology that no one has clearly defined or understood.
  
  When it actually DOES something, let me know. Otherwise, these "CAN, SHOULD, WILL" articles are sales speak.
  - Re: (Score:2)
    
    by geekmux ( 1040042 ) writes:
    
    Asking the questions before it does what they think it will do, is just as futile as someone saying "it CAN do this thing", before it actually does the thing.
    You sound like the lawmaker insisting that we citizens need pass and support a law or bill, and read it afterward to find out how the new law fucked us.
    I sure as hell hope we don't have your attitude when it comes to flipping the switch on the inevitable Skynet. Because when IT does something, we will ALL know. For the rest of our natural lives, however long IT determines that really needs to be.
Subject (Score:5, Interesting)

by Artem S. Tashkinov ( 764309 ) writes: on Thursday September 28, 2023 @02:09PM (#63883963) Homepage

This is an interesting concept albeit almost completely useless and quite energy/resources wasteful at that.
Outside PNG we have WEBP and JPEG-XL both supporting lossless compression and doing so sometimes several times better than PNG. AV1 also supports lossless compression in the form of AV1F but it's not optimized yet and loses to WEBP. VVC must support lossless image compression as well but AFAIK it's not yet implemented by any available encoder.
And since they are classic compression algorithms they don't need a GPU to compress/uncompress images and several gigabytes (terabytes? petabytes? not sure what LLMs operate with) dictionaries to boot.
And WebP/JPEG-XL are not even the best in this regard but they are quite efficient. There are experimental compression algorithms such as paq8px which takes ages to compress/decompress data but they are pretty much unbeatable.
As for FLAC, it is not the most efficient audio compression algorithm either (e.g. OptimFrog compresses a whole lot better but it's very CPU intensive for both compression and decompression) but it has a very good tradeoff between speed and efficiency. I'm afraid this LLM when applied to audio/image compression will be as slow as molasses.

- Re:Subject (Score:4, Insightful)
  
  by algaeman ( 600564 ) writes: on Thursday September 28, 2023 @02:24PM (#63884013)
  
  AFAIK, png uses gzip internally. This is kinda important, since the device decompressing the data may only have 8k of memory, and yet needs to be able to extract that data without having to pick from 70 different compression mechanisms that this LLM may be using.
  
- Put onus on publisher (Score:2)
  
  by Tablizer ( 95088 ) writes:
  
  Rather than keep adding file formats (which gums up conversions), maybe have a format that allows a wide enough variety of rendering methods and put the onus on the compression engines to squeeze out better compression scores/results.
  For example, let's say there are 7 different ways to encode pixels/vectors for a given image type (or even image section, as one type may not fit all parts of an image well). While there may be more than 7 known, they are close enough to the 7 to not bother adding in the name o
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  This is an interesting concept albeit almost completely useless and quite energy/resources wasteful at that.
  Indeed. Probably just some people trying to keep the AI hype going so the dollars of the clueless keep rolling in.
  You are also completely correct that efficiency and speed matter very much.
  - Re:Subject (Score:5, Insightful)
    
    by Rei ( 128717 ) writes: on Thursday September 28, 2023 @03:09PM (#63884131) Homepage
    
    I disagree - there very much are applications for extreme compression even at the cost of high computational loads. Transmission from spacecraft, for example. Very-long-wave transmission through water. Down here on Earth, GSM-by-satellite is in theory coming over the next few years, and the datarate on that is expected to be *awful* - enough for text, but not pictures or video (except at extreme low quality). Being able to squeeze down media by throwing compute at it is very much a useful task.
    And honestly, I'm not sure why people assume "neural network = incredibly wasteful". This particular one may be - they're basing it on a 70B parameter model designed for text, after all - but at the most basic level, there's nothing really inefficient about the logic processes used by neural networks, and they parallelize really well. I imagine you could still get great performance with a vastly smaller network (maybe quantized to 3 bits or so) optimized to the task.
    
    - Re: (Score:2)
      
      by gweihir ( 88907 ) writes:
      
      Not really. These are not orders of magnitude better. They are just "somewhat" better.
      - Re: (Score:3)
        
        by Rei ( 128717 ) writes:
        
        It's also a text-based LLM being used for something it's not remotely designed or trained for, and doing lossless compression. They're just showing it byte sequences and having it - *based on its training on text, not images* - guess what the next byte will be, and then coding the failures with arithmetic compression.
        
        Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        Well, for that approach it is surprising it works at all.
  - Re: Subject (Score:1)
    
    by Mogusha ( 1091607 ) writes:
    
    This idea is pretty much already the very definition of what an autoencoder is supposed to do.
- Re: (Score:2)
  
  by znrt ( 2424692 ) writes:
  
  my thoughts aswell. albeit a remarkable finding, the practical application seems unclear to me. first for cost reasons, such an engine is very costly to create and maintain to just be used as an utility or general purpose compressor for which we have alternatives more than enough. maybe for very specific applications where the amount of data is really huge it could make sense.
  second, besides the catchy tagline (that actually echoes a deep conceptual realization about language models, but in a different sens
- Re: (Score:1)
  
  by Anonymous Coward writes:
  
  Yawn is right, open up big boy
- Re: (Score:2)
  
  by geekmux ( 1040042 ) writes:
  
  It resembles "AI", so marketing of 'next-gen' tech for clickbait was the point(less)-but-profitable point here.
  Those who are hosting multiple terabytes of this kind of data might care, but beyond those 1% of customers a planet shouldn't give a shit if LLMs can do tricks, but because they seem to love clickbait telling them what to care about, it sells.
  Every fucking time.
- Re: (Score:2)
  
  by dhasenan ( 758719 ) writes:
  
  You'd distribute the base model (100GB to 5TB) first since it's useful for a lot of inputs. Then you'd download files that have the transformer plus the compressed data.
  The texts they're testing with are 1GB sections of English language Wikipedia dumps, and they're getting poor compression until they use very large transformers, like 100MB.
  They also have some weird comparisons for partially decoded data and I don't get the point of it. Like, they show intermediate output of gzip that's basically text with t
- Re: (Score:2)
  
  by ceoyoyo ( 59147 ) writes:
  
  Purpose trained ANNs to do compression isn't a new thing, they can work very well, and they generally don't need to be anywhere near as large as a LLM.
  The result is remarkable in that they somehow convince an LLM to do anything losslessly, and compressing non-language data at that.
Three questions (Score:2)

by jd ( 1658 ) writes:

1. Is the compression genuinely lossless or merely perceptually identical?
2. Will it work on any data, or merely data it has been trained on?
3. Do you need to decompress to be able to see the results, or is the new data directly displayable/playable?
- Re: (Score:1)
  
  by gweihir ( 88907 ) writes:
  
  I strongly expect it will be far worse than PNG and FLAC on data it has not be trained on (so basically "almost all"). Overfitting is a bitch. Also probably will have massive, massive overhead in compression and maybe decompression.
- Re: (Score:2)
  
  by Locke2005 ( 849178 ) writes:
  
  If it's perceptually identical, who cares if it is genuinely lossless? Isn't any image made up of pixels inherently lossy?
  - Re: (Score:2)
    
    by iAmWaySmarterThanYou ( 10095012 ) writes:
    
    "Image" is not always the same as "picture I took with a camera".
    "Image" could be human drawn art, camera picture, randomly generated, or not even in the human visible spectrum. Maybe it's IR security footage. Maybe it's graphs and charts from excel, or pdfs.
  - Re: (Score:2)
    
    by ShadowRangerRIT ( 1301549 ) writes:
    
    If you don't care about lossiness, then you can use JPEG or MP3 (or whatever your lossy compressor of choice is; there's lots of things better than those, but those are the ones people recognize immediately) and get something "good enough", lossy, and much better than the lossless equivalents for anything complex (PNG might win for a screenshot of squares, but an image of nature will be quite a bit smaller in default JPEG compression than in lossless PNG). If you're just saying it's a new lossy compressor t
- Re: (Score:3)
  
  by sound+vision ( 884283 ) writes:
  
  (1) Unless they're lying, lossless means lossless.
  (2) Unless the summary is deceptive, it works on any A/V data. How well is another question.
  (3) If by "directly displayable" you mean "will serve as input for a DAC driving a physical monitor or speaker", the answer is no, which is the answer for every A/V compression scheme.
Regarding FLAC, compression isn't the only factor (Score:2)

by Kobun ( 668169 ) writes:

When comparing to FLAC, you need to measure the tradeoff between compression ratio and CPU time to decode. I'd also like to see a few other (more problematic) datasets used as source audio to see how consistent the compression ends up being. A dataset named LibriSpeech leaves me wondering how things will work on something nastier, like dubstep/harpsichord music.
Eier of the Taiga .. (Score:1)

by burni2 ( 1643061 ) writes:

Snap .. Aghate Power ..
Michael Jackson/Dirty Diana: "Da geht der GÃrtner"
Queen/Flash Gordon, Narator: "Gordon is alvie" people heard "Gurkensalat"
Hot Chocolate / "Alle Lieben Mirko"
Using AI what can go wrong .. when even humans sometimes fail.
And I just don't want to think about when these codecs will record speaches and a Mr. Peter File is called. ("The IT Crowd" (Season 2 Episode 4) )
gigabytes-big dictionary? (Score:4, Interesting)

by PhrostyMcByte ( 589271 ) writes: <phrosty@gmail.com> on Thursday September 28, 2023 @02:58PM (#63884103) Homepage

LZ and other compression algorithms work by maintaining a dictionary of patterns to reference. Usually this is megabytes in size, some have a fixed dictionary some have one that evolves over the course of the data.
It seems like the AI model in this case is being used as a gigabytes-sized fixed dictionary. The trick is you need to download the AI model too in order to decompress your files.

- Re: (Score:2)
  
  by evanh ( 627108 ) writes:
  
  This! It's a perfect demonstration of just how gimmicky LLM really is. Total waste of resources.
It's not about the compression (Score:5, Insightful)

by Chelloveck ( 14643 ) writes: on Thursday September 28, 2023 @03:17PM (#63884157)

I think most of the commenters (so far) are missing the point.
I don't think the point here has anything to do with finding a better compression algorithm. The article claims that compression and prediction are functionally equivalent. Given that fact, we can better understand how LLMs work by examining them as if they were compression engines, which we already understand pretty well.
So forget all the posts about "this other algorithm compresses better" or "this is too expensive to be practical". It's not about practicality as a compression engine. It's about a practical technique for understanding something that's pretty abstract. It's science, not engineering.

- Re: (Score:2)
  
  by grmoc ( 57943 ) writes:
  
  Compression is prediction. This is true.
  The more predictable something is, the more easily it compresses.
  A.k.a. the lower the "entropy," the better the prediction.
  Knowing what patterns are most likely by being pre-fed things is a fabulous way to lower entropy.
  It is just surprising that this is an insight. It follows naturally from the definition.
- Re: (Score:2)
  
  by ceoyoyo ( 59147 ) writes:
  
  Equivalent might be a little too strong. The goal of a compression algorithm is to find a representation that is sufficient to reconstruct the input, but smaller than that input. The goal of a generative model is to find a representation that is sufficient to reconstruct samples from the input distribution but is smaller than an actual reasonable set of samples. Both need to identify and exploit structure in the input.
- Re: (Score:2)
  
  by airport76 ( 7682176 ) writes:
  
  Not only that. Thanks to this study we now have a bound to how much generic images and sound can be compressed without loss and practically (in the real world, not in theory). This can serve, for example, as a benchmark for current improvements. Say you use a generic compression algorithm to pre-compute a gigabyte-sized dictionary using a set of images, for as long as it takes to teach a LM. What final compression ratios would we achieve? Would it be close to using a neural network? or would the network be
Compression effectiveness... (Score:2)

by grmoc ( 57943 ) writes:

Compression effectiveness depends on the amount of state fed into the compressor.
The more state fed in, the more compression is possible.
One of the ways to "cheat" here is to already have a library/dictionary of things that you've "fed" the compressor and decompressor.
If you have an image that was similar to one of the dictionary images, then you'll have lots of similarity, and a better compression ratio.
These 'dictionaries' can take many forms, algorithms, prior data, or pre-processed data, such as weights
- Re: (Score:2)
  
  by dhasenan ( 758719 ) writes:
  
  A large base model (roughly terabyte scale, I think). The file you receive is a transformer on top of that model plus some input data. It's a curiosity, not something we'll see in production in the next ten years.
Wait... (Score:2)

by Locke2005 ( 849178 ) writes:

You mean having grad students spend hundreds of hours hand-tuning fractal comprression is no longer the most effective compression technique? Now what will the grad students do to fill their time?
Should we be teaching Generative AI how to do fractal compression? It seems to have a lot of spare time... unlike me.
AI can compress stock images to a few bytes (Score:2)

by Cassini2 ( 956052 ) writes:

As long as the authors are training and testing on stock images in various databases, a sufficiently large AI model should be able to losslessly compress any image down to a handful of bytes (less than 100). The neural networks are fully capable of storing and reproducing known images.
At a certain point, any AI compression algorithm boils down to an image recognition / database lookup algorithm, with the "details" hidden in the neural network model.
It's really hard to know precisely what these researcher
RTFP: size of (de)compressor are huge (Score:3)

by Great_Geek ( 237841 ) writes: on Thursday September 28, 2023 @03:55PM (#63884273)

Table 1 on Compression rates shows TWO different rates. Everyone is getting excited over "raw" compression rate that is pretty good; while ignoring the "adjusted" compression rate that is pretty bad.

The adjusted rate takes into account the size of the models and Chinchilla 70B goes from 8.3% "raw compress rate" to 14,008.3% "adjusted compression rate".

It's well known that "classical" compressors can be improved (by a lot) just by adding a pre-defined dictionary/model, but that would hugely bloat the programs so people don't. These LLM model do exactly that, so they are better. Nothing new.

LLM embeds the image, not the same (Score:2)

by Big_Breaker ( 190457 ) writes:

This isn't a fair comparison. The amazing LLM "compression" is simply the LLM reproducing an image that it was been trained on and is recalling by reference. That's not compression the way most people think about it.
There are already compression algorithms that require a large shared data library on the decompression side that is used improve efficiency. It can drastically improve on zip, arj, rar, .gz etc but that's sort of cheating. Compression algorithms that don't use a library must embed the equiva
Lossless Means Round-Robin Making Same Bytes (Score:2)

by BrendaEM ( 871664 ) writes:

...On many different images, as well.
How big is the "compressor"? (Score:2)

by istartedi ( 132515 ) writes:

I can beat the pants off it with an md5 hash of all the images, but my compressor would be kind of a hefty download. ie, to what extent is the compressor full of "quotations", and thus a rather large download, which would make it impractical?
- Re: (Score:2)
  
  by isomer1 ( 749303 ) writes:
  
  Yes! Thank you! How come nobody talks about this? Don't they have to download/install the entire LLM to recover the compressed information? Or transfer back and forth to a cloud service to do the same?
Yeah, but... (Score:2)

by VeryFluffyBunny ( 5037285 ) writes:

...can it beat off middle-out compression?
- Re: (Score:2)
  
  by samwichse ( 1056268 ) writes:
  
  Middle out doesn't scale! But maybe if we let Son of Anton handle all the routing...
So the next encryption (Score:1)

by vladoshi ( 9025601 ) writes:

requiring a model instead of a key
Wait, what? (Score:2)

by ElizabethGreene ( 1185405 ) writes:

One of the interesting features of LLMs is that they normally incorporate some random seed value so a defined input does not always return the same result. This means I can ping the ChatGPT-3.5-turbo API 50 times with the same prompt and get back 50 different results. My experience is that the results will be similar, but it's rare to get a word-for-word duplicate. I wasn't aware that was something you could turn off, and now I have to figure out how they did it. :)
ImageNet is already compressed (Score:2)

by mkwan ( 2589113 ) writes:

Given the ImageNet is already JPEG compressed, is it possible their model just learned to discard the same information as JPEG?
Information is information (Score:2)

by clambake ( 37702 ) writes:

These things aren't language models, they're information models trained on language. Someday we'll distill them down to their innermost core and suddenly be able to build information-integration/comprehension/analysis systems that behave like magic.
Dammit Baldrson (Score:2)

by rpresser ( 610529 ) writes:

why do you have to keep reminding me you exist

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Re:Is it an LLM, or... (Score:5, Interesting)

Re: (Score:1)

Re:Is it an LLM, or... (Score:4, Interesting)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Subject (Score:5, Interesting)

Re:Subject (Score:4, Insightful)

Put onus on publisher (Score:2)

Re: (Score:2)

Re:Subject (Score:5, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: Subject (Score:1)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Three questions (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Regarding FLAC, compression isn't the only factor (Score:2)

Eier of the Taiga .. (Score:1)

gigabytes-big dictionary? (Score:4, Interesting)

Re: (Score:2)

It's not about the compression (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Compression effectiveness... (Score:2)

Re: (Score:2)

Wait... (Score:2)

AI can compress stock images to a few bytes (Score:2)

RTFP: size of (de)compressor are huge (Score:3)

LLM embeds the image, not the same (Score:2)

Lossless Means Round-Robin Making Same Bytes (Score:2)

How big is the "compressor"? (Score:2)

Re: (Score:2)

Yeah, but... (Score:2)

Re: (Score:2)

So the next encryption (Score:1)

Wait, what? (Score:2)

ImageNet is already compressed (Score:2)

Information is information (Score:2)

Dammit Baldrson (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals