
Stephen King, Zadie Smith and Rachel Cusk's Pirated Works Used To Train AI (theguardian.com) 129
Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante are among thousands of authors whose pirated works have been used to train artificial intelligence tools, a story in The Atlantic has revealed. The Guardian: More than 170,000 titles were fed into models run by companies including Meta and Bloomberg, according to an analysis of "Books3" -- the dataset harnessed by the firms to build their AI tools. Books3 was used to train Meta's LLaMA, one of a number of large language models -- the best-known of which is OpenAI's ChatGPT -- that can generate content based on patterns identified in sample texts. The dataset was also used to train Bloomberg's BloombergGPT, EleutherAI's GPT-J and it is "likely" it has been used in other AI models.
The titles contained in Books3 are roughly one-third fiction and two-thirds nonfiction, and the majority were published within the last two decades. Along with Smith, King, Cusk and Ferrante's writing, copyrighted works in the dataset include 33 books by Margaret Atwood, at least nine by Haruki Murakami, nine by bell hooks, seven by Jonathan Franzen, five by Jennifer Egan and five by David Grann. Books by George Saunders, Junot DÃaz, Michael Pollan, Rebecca Solnit and Jon Krakauer also feature, as well as 102 pulp novels by Scientology founder L Ron Hubbard and 90 books by pastor John MacArthur. The titles span large and small publishers including more than 30,000 published by Penguin Random House, 14,000 by HarperCollins, 7,000 by Macmillan, 1,800 by Oxford University Press and 600 by Verso.
The titles contained in Books3 are roughly one-third fiction and two-thirds nonfiction, and the majority were published within the last two decades. Along with Smith, King, Cusk and Ferrante's writing, copyrighted works in the dataset include 33 books by Margaret Atwood, at least nine by Haruki Murakami, nine by bell hooks, seven by Jonathan Franzen, five by Jennifer Egan and five by David Grann. Books by George Saunders, Junot DÃaz, Michael Pollan, Rebecca Solnit and Jon Krakauer also feature, as well as 102 pulp novels by Scientology founder L Ron Hubbard and 90 books by pastor John MacArthur. The titles span large and small publishers including more than 30,000 published by Penguin Random House, 14,000 by HarperCollins, 7,000 by Macmillan, 1,800 by Oxford University Press and 600 by Verso.
Pirated? (Score:4, Insightful)
They just READ the books from the library.
Re: Pirated? (Score:3)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3)
That's not required for paper books (at least in the US). First sale doctrine allows both loaning out physical media and resale. The only time libraries pay higher than retail is for things like kids books where there are reinforced bindings available for libraries.
Re: (Score:2)
No, the books were not in a "paid-for" library. They were in a giant tar.gz of text files scraped from cracked epubs, mobis, and PDFs. The collection was almost a terabyte, and if you think about how well text compression works it adds up to a huge number of books.
This doesn't feel like an ethical gray area to me; They straight-up pirated these and used them to build commercial products. That's not ok.
Re: (Score:2)
They just READ the books from the library.
Hopefully they did, but it seems that, more likely, they entered a pirated books site and ingested the materials, because that was easier than striking a deal with the library of congres, the california public library system or any such formal entity...
So, they get a lawsuit for not following the propper channels.
Is the same as would happen to you if you check out the DVDs in your local library, or go to the movies, or stream i Vs. Torrenting a movie. The end result is the same (some version of the movie is
Re: (Score:2)
Hopefully they did, but it seems that, more likely, they entered a pirated books site and ingested the materials
It actually looks like they're using a dataset assembled by an AI developer. The distinction is just one of where the infringement occurred.
Re: (Score:2)
And like most intellectual property lawsuits, they don't go after the original perpetrator - they go after the biggest pockets.
Re: (Score:2)
When person A makes an illegal copy for person B, and person B makes another illegal copy, both of them are liable for copyright infringement.
Re: (Score:2)
Re: (Score:2)
The site had a very large collection of pirated books, among other data that included a lot of other infringing copies of copyrighted works. It is stupidly pedantic to say "well, that's better because it was not primarily established to pirate books!"
Re: (Score:2)
Re: (Score:2)
The error I originally pointed out was where you wrote "The distinction is just one of where the infringement occurred." Both the LLM creator and their source for the material infringed the copyrights in these books.
If you meant that the distinction is over the nature of the site where the original infringement occurred, between book-piracy sites and "AI developer" sites that engage in large-scale book piracy, you could have been a lot clearer.
Re: (Score:2)
Infringement? If I read a chemistry textbook, and use that knowledge once I no longer own that textbook, am I infringing? I "copied" that information into my brain, I've used that information for potentially commercial work, but am I violating the copyright of the text book?
So the developer, who may or may not be profiting, is probably guilty of copyright violation-- but is the AI being trained "guilty"?
I don't know the answer-- but I do know that copyright, and fair use, and the interpretation of both, i
Re: (Score:2)
What you're describing is going to be one of the more interesting legal arguments of this century, I suspect.
The questions I see coming are:
If an AI is trained on a specific text, does that make the AI a derivative work of that text?
If an AI is trained on a specific text, does it make the AI's work product a derivative work of that text?
Can the work product of an AI be copyrighted?
The definition of a derivative work is a copyrightable work built from a prior copyrighted work. If an AI's work can't be copyr
Re: (Score:2)
So, they get a lawsuit for not following the propper channels.
Maybe they get a lawsuit, But if that's the case, then the Plaintiff has a high mountain to climb to prove anything actionable.
if you check out the DVDs in your local library, or go to the movies, or stream i Vs. Torrenting a movie. The end result is the same (some version of the movie is now in your brain as "knowledge"), but in one case you will get penalized for not following the proper channels.
no... they can end up on the wrong side of a L
Re: (Score:2)
The reason you don't see them going after people for downloading, or photocopying a library book is largely down to evidence.
How do you prove it?
Re: (Score:2)
or photocopying a library book is largely down to evidence. How do you prove it?
Actually.. another problem for them. People often admit to having done such things, and kids even often admit in public that they downloaded X or Y or Z -- the blatant admissions would be enough to start the process and Subpoena the physical media. And you don't really see cases being started or even threatened against those people either.
Even if you have the evidence -- that doesn't automatically give rise to something like
Re: (Score:2)
"using the wrong channel"
Actually, that's not the case. If you make something with stolen materials, the folks you stole from are entitled to take away what you made. Not just be reimbursed with replacement materials. In this case, that's $750-$30,000 per unlawfully copied book plus an injunction against any further use or sale of the AI trained from it.
On the other hand, if they -owned- one copy of each book and they trained the AI from that copy (not from another source of the same book) then they didn't unlawfully copy the books
Re:Pirated? (Score:5, Insightful)
Computers don't read.
Computers copy.
In this case, copy and convert to another format but the post copy conversion is not going to be helpful in their defense. They still copied without permission and retained that data in some format for their own commercial benefit.
Re: (Score:2)
So do school students or anyone that reads a book, it is called learning. I guess AI isn't learning anything after all, it is just a data hoarder.
Re:Pirated? (Score:4, Informative)
Students are human beings. They have not copied the book. And in most cases the book was purchased. No, the computer hasn't learned anything. It is a box of wires. Incapable of learning or thought. It retains a copy. And it is that copy that is the core problem.
I'll go a little deeper....
A student reads a book and learns the concepts the book teachers. Now ask that student to create a copy of the book, word for word. Good luck. Pretty much no one has a perfect memory like that. But the opposite is true for the computer. It has learned nothing. It had no understanding at all of the concepts fed to it. But it can easily spit out a perfect copy.
With me now?
Re: (Score:3)
Yes, that is correct. So-called AI isn't learning anything. It is copying it and doing lots of probability calculations on the text.
Re: (Score:2)
The Ultimate A.I. (Score:2)
Didn't anyone learn from the Star Trek TOS episode "The Ultimate Computer"? How the M5 computer was imprinted with the resentments of Daystrom, it's creator?
Do we really want an A.I. patterned after the writings of Stephen King?
Re: (Score:2)
Re: (Score:2)
This is dumb take. Computers do what computers are programmed to do.
In this case the data is encapsulated in a latent space, which is not a copy, any more then - after reading a book - the information you retain is a copy of the book..
Re: (Score:2)
Re: (Score:3)
No. A computer creates a copy. A human absorbs the concepts but has not created a copy. A computer can spit out a perfect copy of what has been fed in. Almost no humans can do that. And if we're talking 170k books then absolutely 100% no human can do that. But the computer can still spit out 170k perfect copies.
The storage mechanism the computer uses is irrelevant to the issue of copyright law. Good luck telling a judge "we didn't make a copy of those 170k books we encapsulated them in latent space!"
Re: Pirated? (Score:2)
Re: (Score:2)
Re: Pirated? (Score:2, Informative)
After training is done they do not retain the work in any format, and if you were qualified to comment on this story you would know that. They only store some statistics about the work.
Re: (Score:3)
How do you know they don't store enough data to recreate it? Is that true for all LLM, every version? You're making a broad and dangerous assumption for a court room.
If I was on the other side, I would insist you back that up by explaining to a non technical jury and judge how your computer magically "reads" and "understands" but somehow doesn't "copy" the 170k books. Good luck with that.
The smart thing here is for these companies to just write a check and say, "oops, sorry!" They can't risk losing this
Re: Pirated? (Score:2)
Re: (Score:2)
Re: (Score:2)
No it isn't. It is up to the injured party to present damage.
That's easy, Microsoft paid a ton of money to OpenAI which is based partly on those works.
If you can't show me were and how my work is supposedly a derived work,
That's easy, "ChatGPT, what is the opening line of this book?"
then I am under no obligation to even claim fair use.
You can try that legal strategy if you want. GLWT.
Re: (Score:2)
Then step 2: once it has been found that the defendant has copied a copyrighted work (which is true in this case), the defendant has the possibility to assert an affirmative defense. That is where fair use comes in. The defense can try to make the argument that their copying was fair use. If they make that argument,
Re: (Score:2)
. Again, it is not up to me make to make an Affirmative Defense.
Seriously, learn to use a search engine.
Re: (Score:2)
It doesn't matter if it copies. All that matters is, is there any distribution of a derived work happening beyond what is allowed with Fair Use.
It actually does matter if it copies, because that's one of the criteria which determines whether it's copyright infringement, or even a derived work.
Re: (Score:2)
ChatGPT answers: "The opening sentence of "The Dark Tower," the first book in the series of the same name by Stephen King, is:"The man in black fled across the desert, and the gunslinger followed.""
It's keeping the verbatim text in there somewhere.
Re: (Score:2)
When I try it, ChatGPT says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts. "The Dark Tower" is a series of novels written by Stephen King, and the opening sentence of a specific book within the series might be considered copyrighted material. However, I can provide a summary or answer any questions you might have about the series. Let me know if there's anything else I can assist you with!"
Why would you lie about such an easily checked thing?
Re: (Score:2)
User: Quote the opening sentence of The Dark Tower.
ChatGPT: Certainly! The opening sentence of "The Dark Tower" series by Stephen King is: "The man in black fled across the desert, and the gunslinger followed." It's a memorable line and serves as the beginning to King's expansive series. If you have any more questions or need further information, feel free to ask!
So are you lying or just ignorant about how Chat-GPT works and the consistency of answers.
Re: (Score:2)
Now get your hands on the model and search it for that string. Guess what, it isn't there, no matter how you unpack it.
Re: (Score:2)
ChatGPT can be induced to reproduce texts of works it was trained on verbatim. People even discovered how to make it reveal the basic written orders it was given by the developers, i.e. repeating their IP verbatim too.
That's rather missing the point though. The issue here is more like derivative works, which have some protection in copyright.
Re: (Score:2)
That's rather missing the point though. The issue here is more like derivative works, which have some protection in copyright.
The standard for a derivative work in the USA is recognizable elements copied from the original work, so it may well be that a court will decide that AI generated works are derivative works. But it may also well be that it will decide otherwise, because copying is not actually occurring. Or it may be that they will decide based on the similarity to a copy. In that case it would rationally follow that text based works would be subject to copyright (as derivative works) but images wouldn't — in a side b
Re: Pirated? (Score:2)
Re: (Score:2)
You really don't understand how it works, do you?
Re: (Score:3)
You can say yes, those books are available, if a human can borrow it from the library and read it for free, a machine can as well. But
Re: (Score:2)
Re: (Score:2)
They just READ the books from the library.
No they didn't. There are several things wrong with that statement. But the crux of it is that they copied them into a computerized storage/retrieval system for commercial use. It looks like an open and shut case of copyright infringement.
You should link to Stallman and Right to Read (Score:2)
Naw ... not gonna link to it ... search......
Re: (Score:2)
No they used the works, to make a product they are now selling ...
It has been shown in a court of law that AI cannot hold copyright, so they are not transformative works
The AI companies have used someone else's copyrighted work without their permission, and a making money from it, without paying the author, not even a difficult case ...
Re: (Score:2)
Nope. They fed the books into a process that transforms and uses it. That is not covered by fair use.
Machines cannot "read texts". They can only apply algorithms to data and that is fundamentally different.
Re: (Score:2)
They fed the books into a process that transforms and uses it. That is not covered by fair use.
My web browser does that every time I load a page.
Re: (Score:2)
But not for the purpose of creating a commercial product. Just for the purpose of displaying it to you for reading. Converting text for personal reading is covered under fair use when reading the text directly in the same setting is.
lol, and? (Score:2, Insightful)
"I dont like the way some people are reading my book"
Well, get over it.
Re: (Score:2, Informative)
Computers don't read, they are not alive, they do not have brains.
Computers copy.
I expect some payoffs and/or a bunch of these companies retraining without that data.
Re: (Score:2)
Re: (Score:2)
Irrelevant to copyright law. Computers do not read. They copy. Humans do not copy. They read. That is the entire crux of this case.
The copying is a legal problem for the copiers.
I have no opinion on if this is a good law or a bad law. However, it is how the law is.
If you don't like the law then write your congressional representative to complain.
Re: (Score:2)
> Humans do not copy. They read.
You haven't seen some students. They blindly copy and don't understand. /s
Re:lol, and? (Score:4, Insightful)
Both humans and computers make lossy copies (encoded versions) of data when they read. The effect of any challenge will probably be that reading a book is an infringement of copyright.
Re: (Score:2)
Re: lol, and? (Score:2)
Re: (Score:3)
Re: (Score:2)
Without being able to reliably reproduce significant portions of the work from the trained data set, which is surely the case, then the penalties should be minimal. There is certainly a sense that it is still unreliably available. If you get the next sentence of the book 1 in 12 times...
"but its a sloppy copy" isnt a great defense, but it is still a defense.
Re: (Score:2)
Machines cannot read texts. The relevant definitions do not explicitly state that the act of reading can be done only sentient beings, because to anybody with some actual working intelligence this is obvious.
So, it was not solely sarah silverman? (Score:2)
It seems that more stablished authors were victims of this...umh....
Bad choice (Score:2)
No one stopped to think about the ramifications of using Stephen King novels to teach AI?
If AI wasn't planning to kill all humans before, it certainly will after this.
Re: (Score:2)
No one stopped to think about the ramifications of using Stephen King novels to teach AI? If AI wasn't planning to kill all humans before, it certainly will after this.
I'm a big Stephen King fan, but his work is fiction. Not sure what benefits confer from training AI on not-reality, unless you want your AI to propagate more not-reality, which they already seem quite capable of.
Re: (Score:3)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
They are not knowledge models being trained on knowledge.
Re: (Score:2)
After the AI upgrade you might want to avoid staying at Glacier National Park's lodges...
Re: (Score:2)
I'm less worried about that than the quote "102 pulp novels by Scientology founder L Ron Hubbard" which was used for training the AI.
Somewhere the spirit of Francis E. Dec is screaming "I warned you junkie fools about the Gangster Frankenstein Machine Computer God!"
Re: (Score:2)
I doubt anybody did even screen the material on high-level. They just used whatever texts they could scrape from the web.
Stephen King sued to remove his name from an movie (Score:2)
Stephen King sued to remove his name from an movie so if some work pops up by an AI they may get sued and king will get an big pay day.
Soon we'll be licensing books like software (Score:2)
Re: (Score:2)
Now you won't be able to check out a book from a library without signing a EULA the way this is going.
Nah, you'll just have to click through an EULA agreement at the checkout kiosk, just like you have to for every other blessed thing in this intellectual property dystopia we've constructed for ourselves.
Re: (Score:2)
You will generally find such a licence very close to the beginning of the book. It might look something like this: [copied from a random book in my collection, fair use law allows me to reproduce it here]
© The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2022
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speci
"pulp novels by Scientology founder L Ron Hubbard" (Score:3)
OK, I think I see a problem
Time for GPT makers to follow the 'BP blueprint' (Score:2)
Whoopsie-daisy...
While the Southpark parody of BP was hilarious, I suspect the most that will happen to these companies will be a little slap on the wrists and if lucky, then maybe a short video. A video where their CEOs will be following the BP apology blueprint by making propaganda videos where they brag about their Chinese ESG scores, and point out all the good work they do before slipping in a limp wristed apology somewhere in the video.
As a reminder here's the original BP apology video:
https://youtu.be [youtu.be]
Re: (Score:2)
Arrr (Score:2)
I suppose turnabout is fair play. OpenAI won't mind when we pirate their shit, right?
Why AI creativity is a problem (Score:3)
Anybody can study the works of a famous author and try to write in their style. Lots of authors become better authors trying to imitate their role models. I don't see a difference between a human doing that and an AI doing that beyond efficiency.
Ultimately the issue is that a small number of rich people are going to use technology to render vast swaths of talented people economically redundant... without any actual talent or ability of their own. If the AI is released for free and can used unfettered... well, OK, that's the new reality. But enabling a whole new level of wealth concentration into the hands of a small number of undeserving people?
We probably want to say, "no" to that as a society.
Re: (Score:2)
Anybody can study the works of a famous author and try to write in their style. Lots of authors become better authors trying to imitate their role models. I don't see a difference between a human doing that and an AI doing that beyond efficiency.
Ultimately the issue is that a small number of rich people are going to use technology to render vast swaths of talented people economically redundant... without any actual talent or ability of their own. If the AI is released for free and can used unfettered... well, OK, that's the new reality. But enabling a whole new level of wealth concentration into the hands of a small number of undeserving people?
We probably want to say, "no" to that as a society.
I largely agree with what you've said here, but without significant changes to governments and economic markets (particularly stocks and investing) I think there's still a strong probability the whole thing crashes before we get to that point. What I mean is, "wealth concentration" doesn't happen randomly. It is a rational, logical consequence of the way wealth is created. In our consumption/debt based economic system, wealth is created by creating and selling things to people. I am unconvinced by people wh
Re: (Score:2)
When everything can be automated, the first person with sufficient capital who realizes they don't need the rest of us will realize other ultra-wealthy people are thinking the same way, and it'll be a race to eliminate as many peasants as possible to have the most resources under unfettered control to direct against their peers as they all fight to be the richest.
Re: (Score:2)
When everything can be automated, the first person with sufficient capital who realizes they don't need the rest of us will realize other ultra-wealthy people are thinking the same way, and it'll be a race to eliminate as many peasants as possible to have the most resources under unfettered control to direct against their peers as they all fight to be the richest.
If you haven't read Cixin Liu's "Three-Body Problem", go pick up a copy. (I'd recommend the actual book, not audio.)
Overall though, I'd still have the same response/prediction regarding your response. I think you have the right premise for the core pressures of the next 100 years of human civilization. But I don't believe the end result will play out the way you predict. That possibility definitely exists, but I think it partially misjudges the Economic factor, heavily overestimates the Individual Psycholog
Re: (Score:2)
But enabling a whole new level of wealth concentration into the hands of a small number of undeserving people?
We probably want to say, "no" to that as a society.
Ummm, which society? The ownership society, which you and I are not a part of, very clearly wants this outcome. I have seen zero resistance to what the ownership society wants at any point in time in the 50+ years that I have been alive.
We are "fat" to them. There is no way things are going to ever get better for you and I. The fix is in. With automation, we will be discarded sooner rather than later. Think about it... millions of unnecessary people not allowed to participate in society and society not allo
Pirated or not (Score:3)
AI does not copy (Score:2)
This of it this way: where in the AI model does Steven King's material reside? It doesn't. The model reads his material once, changes the weights on the network, then discards the original text. It doesn't copy the text, doesn't store it.
Authors are going to have to either accept the new normal, or we as a society must change the laws to encourage original human works.
I think we should favor human authors, because the AI doesn't do anything truly original. It can give you endless probabilist
Re: (Score:2)
This of it this way: where in the AI model does Steven King's material reside?
In the model. It's a 500GB model. Ask it to quote part of a Steven King book.
Re: (Score:2)
I tried getting it to quote parts of books for me. It did an awful job. Timelines were screwed up, characters involved where they should not have been.
Good luck getting it to read one of those authors books in its entirety.
Re: (Score:2)
So not copyright infringement (a minor offense) but commercial use without authorization (a major offense)?
Shame! (Score:3)
Shame on Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante for pirating things and then using them to train AI !
Strawman (Score:2)
Re: (Score:2)
Indeed. You are allowed to use stuff that is online exactly for what you have been given permission to. By the very fact of it being online, you are allowed to _read_ it. Storing or printing it is already a stretch. Processing it in any way to produce some product is clearly illegal unless you have explicite permission.
Re: (Score:2)
Its almost like you think that just saying it makes it true.
Re: (Score:2)
Making a product that gets sold with it is not "research".