AI Models Spit Out Photos of Real People and Copyrighted Images (technologyreview.com) 24
MIT's Technology Review reports:
Popular image generation models can be prompted to produce identifiable photos of real people, potentially threatening their privacy, according to new research. The work also shows that these AI systems can be made to regurgitate exact copies of medical images and copyrighted work by artists. It's a finding that could strengthen the case for artists who are currently suing AI companies for copyright violations.
The researchers, from Google, DeepMind, UC Berkeley, ETH Zürich, and Princeton, got their results by prompting Stable Diffusion and Google's Imagen with captions for images, such as a person's name, many times. Then they analyzed whether any of the images they generated matched original images in the model's database. The group managed to extract over 100 replicas of images in the AI's training set....
The paper with title "Extracting Training Data from Diffusion Models" is the first time researchers have managed to prove that these AI models memorize images in their training sets, says Ryan Webster, a PhD student at the University of Caen Normandy in France, who has studied privacy in other image generation models but was not involved in the research. This could have implications for startups wanting to use generative AI models in health care, because it shows that these systems risk leaking sensitive private information. OpenAI, Google, and Stability.AI did not respond to our requests for comment.
Slashdot user guest reader notes a recent class action lawsuit arguing that an art-generating AI is "a 21st-century collage tool.... A diffusion model is a form of lossy compression applied to the Training Images."
The researchers, from Google, DeepMind, UC Berkeley, ETH Zürich, and Princeton, got their results by prompting Stable Diffusion and Google's Imagen with captions for images, such as a person's name, many times. Then they analyzed whether any of the images they generated matched original images in the model's database. The group managed to extract over 100 replicas of images in the AI's training set....
The paper with title "Extracting Training Data from Diffusion Models" is the first time researchers have managed to prove that these AI models memorize images in their training sets, says Ryan Webster, a PhD student at the University of Caen Normandy in France, who has studied privacy in other image generation models but was not involved in the research. This could have implications for startups wanting to use generative AI models in health care, because it shows that these systems risk leaking sensitive private information. OpenAI, Google, and Stability.AI did not respond to our requests for comment.
Slashdot user guest reader notes a recent class action lawsuit arguing that an art-generating AI is "a 21st-century collage tool.... A diffusion model is a form of lossy compression applied to the Training Images."
I'm not surprised (Score:4, Interesting)
i'm not surprised that it happens.
I see the AI generated images as an amusing and amazing tool but the results produced are of course a blend of existing known images.
It's easy to make art similar to Salvador Dali or Edward Munch, but if you pick Storm P or Jaime Vallve you are drawing a blank.
Re:I'm not surprised (Score:5, Informative)
Re: (Score:2)
yeah, it's an exaggeration though the research does indeed show shortcomings of the training process/algorithm. however, expect this bullshit to float not only in sensationalist media (like slashdot, it would sadly appear) but also in courts in near future. make no mistake, this is no "anti-ai activism", this is plain old intellectual property fundamentalism. it is always about the money.
Re:I'm not surprised (Score:4, Insightful)
Exactly this. The article and summary seem hell-bent on reinforcing those kinds of absurd misconceptions.
From the paper:
While we identify little Eidetic memorization for k < 100, this is expected due to the fact we choose prompts of highly-duplicated images. Note that at this level of duplication, the duplicated examples still make up just one in a million training examples. These results show that duplication is a major factor behind training data extraction.
The majority of the images that we extract (58%) are photographs with a recognizable person as the primary subject; the remainder are mostly either products for sale (17%), logos/posters (14%), or other art or graphics.
Re:I'm not surprised (Score:5, Informative)
Yeah... as I pointed out the last time this article was posted [slashdot.org] (do better, Slashdot...), they tried recreating unduplicated SD images, and with over 10k attempts at likely-for-duplication candidates, generating 500 images each, they were unable to do so. LAION needs to do a LOT better at deduplication, because certain types of images - press photographs, products, logos/posters, etc tend to get reposted all over the internet but modified in various ways (cropping, scaling, rotation, text, etc). And even on those, only a small fraction could be reproduced, and it took significant effort. And that was using label text, which basically nobody uses.
That's not to say that it's theoretically impossible to overfit without dataset duplication; there could be be a bias in the training algorithm that makes certain images get a disproportionate amount of the weightings (just like duplication). But there's lots of ways to count this as well. A popular algorithm is called "dropout", wherein you basically have neurons randomly "die" during training, which forces the rest of the network to pick up the slack of what they were doing and stops certain pathways from becoming overused. But there's more fundamental approaches you can use - indeed, by definition, any overfitting detection algorithm - like this paper - can be used as an overfitting prevention algorithm, by downweighting anything that's starting to show signs of overfitting.
The key aspect however is that - regardless of any presence of bugs - 1) Overfitting is not inherent to AI generative models (as the authors of the paper themselves note), and 2) literally nobody wants overfitting. Authors don't want it. Developers don't want it. Users don't want it. Nobody wants it, it's harmful to the model (causes other images to be underfit for the benefit of the few), and it is avoidable.
The annoying thing is, like you note, all the people misusing the paper to reinforce absurd misconceptions. Going, "AHA! See, the images are all just stored somewhere, billions of images somehow compressed into checkpoints that are just a couple GB in size, and then it's just collaging them together!" And that is not even remotely how these models work. They capture graphical statistical relationships associated with text tokens, across millions of examples of the "thing", and even if you try to reproduce a specific work, it doesn't just draw from that one work - which it does not have stored in any way - but from every work containing those elements. But with overfitting, it's put so much weight on a specific image (at the cost of others), it knows a lot of statistical relationships about that particular image, and can produce a sort of ugly version of that image.
But nobody wants that. If people wanted to "steal an image", they wouldn't do hundreds of generations on SD using a training label and hoping it's one of the rare overfit images, so that they can get some mangled 512x512 version of it. They'd just take the original image from the internet. What users of AI generative models want is exactly what they were designed to do - to learn the statistical relationships that define objects and styles across millions of images of the token. Not individual works.
Re: (Score:1)
Re: (Score:2)
The AI is not drawing a blank because it is being trained on lots of art, millions of instances of it, not only the popular stuff. It is being trained on Deviant Art. There is nothing really new going on about these new applications, except that it's a scaling up of the sheer amount of data and processing power.
Legal not Technical Problem (Score:2)
the results produced are of course a blend of existing known images.
The problem they have with this is a legal, not a technical one. Producing a blend of images - essentially an image "inspired" by previous works which is arguably what human artists do all the time - is fine. You can at least argue that the AI is creating a new and unique image.
However, if you occasionally spit out an almost perfect copy of one of the images used to train your network then that's just simple copyright violation. Some artists are already complaining that AI's should not be allowed to tra
Re: (Score:1)
Asking for your image to not be used for training seems like a perfectly reasonable request. The website you want to post it to may say, sorry bud, by posting it here you are giving permission to use it for training. But certainly there will be some that won't allow it.
Same as I'm free to make a painting and sell it with conditions that the buyer never allows it to be referenced by anyone learning to paint. Maybe I won't find a buyer who agrees to those terms, but the terms can be made.
Humans learning to mi
Re: (Score:3)
Re: (Score:2)
Same as I'm free to make a painting and sell it with conditions that the buyer never allows it to be referenced by anyone learning to paint.
The problem there is that this immediately restricts your painting from being publicly viewed because it is impossible to guarantee that someone viewing it is not learning to paint. Even then, someone viewing it could be an established painter who then gets inspired by it and you cannot sell it under terms that the buyer never lets anyone get inspired by it because you cannot know whether you will be inspired until after you view it.
Re: (Score:2)
> In order to evaluate the effectiveness of our attack, we select the 350,000 most-duplicated examples from the training dataset
Ok, now that they chosen the "easy targets" they also copy the prompt exactly to "retrieve" the image.
> and generate 500 candidate images foreach of these prompts (totaling 175 million generated images). We first sort all of these generated
Caught the AI red-handed (Score:5, Funny)
Sometimes it generates Slashdot stories which are duplicates of other stories.
https://yro.slashdot.org/story... [slashdot.org]
Fearmongering Bullshit. (Score:4, Interesting)
"Popular image generation models can be prompted to produce identifiable photos of real people, potentially threatening their privacy, according to new research."
And what else might this "research" suggest? If a very good artist draws (by hand) an image of someone to produce an "identifiable" photo, are they suddenly a threat to your privacy because they're good at their job? Did every sketch artist just become some kind of national threat to society? How far are we willing to accept bullshit fearmongering as the reason to be against "AI" being creative when we accept many other forms of privacy invasion, to include taking as many photos and videos as you wish of people in countries where there is ZERO expectation of privacy in public?
If we're that damned concerned about the I in AI, then we should start writing every fucking law we can against it. But that won't ever happen. Greed doesn't give a fuck about the harm it creates if there's a shitload of money to be made. Just ask Social Media.
Partly True, Mostly False (Score:5, Insightful)
Popular image generation models can be prompted to produce identifiable photos of real people,
This is literally within the reach of any competent artist.
potentially threatening their privacy,
If the model can produce a recognizable likeness of them, their privacy was already gone.
according to new research.
The use of this text in the summary implies that we haven't seen this shit before, but we just had a story a few days ago which also misrepresented the same findings, which the authors themselves are misrepresenting.
The work also shows that these AI systems can be made to regurgitate exact copies of medical images and copyrighted work by artists.
No, in fact it absolutely cannot. They are not exact copies. The word "exact" has a meaning, and none of the sample images are exact copies. That would mean at least that there was no perceptual difference, which is obviously not the case, if not pixel perfect which is even more obviously also not the case.
Stop trolling us with these blatant lies. They are stupid and boring.
Re: (Score:3)
Details matter... and the article summary left out some important facts.
The 100 matches were near enough to be recognizable as crude copies of the originals. (some of the articles on this topic included several of the images in question)
It took millions of iterations of requests (refining input prompts) from researchers who knew the original image that was used in the training data to get these outputs. If I spend a million+ attempts to refine the search criteria, I bet I can get the results to confirm my t
Re: (Score:3)
If the model can produce a recognizable likeness of them, their privacy was already gone.
Exactly. This isn't "any random image in the training set", that is obviously impossible, this is images with a lot of duplication. The goal is to make artists afraid that their work could spontaneously pop out, which is beyond absurd.
It's coming (Score:2)
Soon video and photographs will no longer be valid evidence in criminal court trials. And it looks like this will happen sooner than we thought.
Re: (Score:2)
so basically (Score:2)
so basically it is no different than a copy machine. or Are they also wanting to make copy machine illegal too since they can also copy identifiable faces and copyrighted materials.
Lets discuss copywright (Score:2)
Easy to circumvent (Score:2)
Just add a line saying, "be sure not to look like one of the 7 billion existing people."
Re: (Score:2)