Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
AI

Stephen King, Zadie Smith and Rachel Cusk's Pirated Works Used To Train AI (theguardian.com) 129

Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante are among thousands of authors whose pirated works have been used to train artificial intelligence tools, a story in The Atlantic has revealed. The Guardian: More than 170,000 titles were fed into models run by companies including Meta and Bloomberg, according to an analysis of "Books3" -- the dataset harnessed by the firms to build their AI tools. Books3 was used to train Meta's LLaMA, one of a number of large language models -- the best-known of which is OpenAI's ChatGPT -- that can generate content based on patterns identified in sample texts. The dataset was also used to train Bloomberg's BloombergGPT, EleutherAI's GPT-J and it is "likely" it has been used in other AI models.

The titles contained in Books3 are roughly one-third fiction and two-thirds nonfiction, and the majority were published within the last two decades. Along with Smith, King, Cusk and Ferrante's writing, copyrighted works in the dataset include 33 books by Margaret Atwood, at least nine by Haruki Murakami, nine by bell hooks, seven by Jonathan Franzen, five by Jennifer Egan and five by David Grann. Books by George Saunders, Junot DÃaz, Michael Pollan, Rebecca Solnit and Jon Krakauer also feature, as well as 102 pulp novels by Scientology founder L Ron Hubbard and 90 books by pastor John MacArthur. The titles span large and small publishers including more than 30,000 published by Penguin Random House, 14,000 by HarperCollins, 7,000 by Macmillan, 1,800 by Oxford University Press and 600 by Verso.

This discussion has been archived. No new comments can be posted.

Stephen King, Zadie Smith and Rachel Cusk's Pirated Works Used To Train AI

Comments Filter:
  • Pirated? (Score:4, Insightful)

    by nospam007 ( 722110 ) * on Thursday August 24, 2023 @04:01PM (#63794242)

    They just READ the books from the library.

    • This. Lots of organizations are hoping for a payday, but it's not going to happen. Reading material that is publicly available - be it a website or a book in a library - where's the problem?
      • It's also useless at this point, the AI baby ain't getting unborn.
      • Were the books in that library paid for? Generally a library pays 2x-3x than normal retail to purchase a book that will be read by many people. If that fee was paid well then too bad for the authors and publishers, you already received your money. If those copies were not paid for well then there might be a case for less than $100 in damages on each book and perhaps an even higher punitive fee. Likely not the payday they are hoping for though.
        • That's not required for paper books (at least in the US). First sale doctrine allows both loaning out physical media and resale. The only time libraries pay higher than retail is for things like kids books where there are reinforced bindings available for libraries.

        • No, the books were not in a "paid-for" library. They were in a giant tar.gz of text files scraped from cracked epubs, mobis, and PDFs. The collection was almost a terabyte, and if you think about how well text compression works it adds up to a huge number of books.

          This doesn't feel like an ethical gray area to me; They straight-up pirated these and used them to build commercial products. That's not ok.

    • They just READ the books from the library.

      Hopefully they did, but it seems that, more likely, they entered a pirated books site and ingested the materials, because that was easier than striking a deal with the library of congres, the california public library system or any such formal entity...

      So, they get a lawsuit for not following the propper channels.

      Is the same as would happen to you if you check out the DVDs in your local library, or go to the movies, or stream i Vs. Torrenting a movie. The end result is the same (some version of the movie is

      • by Xenx ( 2211586 )

        Hopefully they did, but it seems that, more likely, they entered a pirated books site and ingested the materials

        It actually looks like they're using a dataset assembled by an AI developer. The distinction is just one of where the infringement occurred.

        • And like most intellectual property lawsuits, they don't go after the original perpetrator - they go after the biggest pockets.

        • by Entrope ( 68843 )

          When person A makes an illegal copy for person B, and person B makes another illegal copy, both of them are liable for copyright infringement.

          • by Xenx ( 2211586 )
            Sure, but in this case it would still be inaccurate to say Person B went to a pirated book site and ingested the materials.
            • by Entrope ( 68843 )

              The site had a very large collection of pirated books, among other data that included a lot of other infringing copies of copyrighted works. It is stupidly pedantic to say "well, that's better because it was not primarily established to pirate books!"

              • by Xenx ( 2211586 )
                I acknowledged the distinction was minimal in my first point. You're the one that opted to nitpick me after that. If one of us is to be labeled pedantic, it's not me.
                • by Entrope ( 68843 )

                  The error I originally pointed out was where you wrote "The distinction is just one of where the infringement occurred." Both the LLM creator and their source for the material infringed the copyrights in these books.

                  If you meant that the distinction is over the nature of the site where the original infringement occurred, between book-piracy sites and "AI developer" sites that engage in large-scale book piracy, you could have been a lot clearer.

        • by ufgrat ( 6245202 )

          Infringement? If I read a chemistry textbook, and use that knowledge once I no longer own that textbook, am I infringing? I "copied" that information into my brain, I've used that information for potentially commercial work, but am I violating the copyright of the text book?

          So the developer, who may or may not be profiting, is probably guilty of copyright violation-- but is the AI being trained "guilty"?

          I don't know the answer-- but I do know that copyright, and fair use, and the interpretation of both, i

          • What you're describing is going to be one of the more interesting legal arguments of this century, I suspect.

            The questions I see coming are:
            If an AI is trained on a specific text, does that make the AI a derivative work of that text?

            If an AI is trained on a specific text, does it make the AI's work product a derivative work of that text?

            Can the work product of an AI be copyrighted?

            The definition of a derivative work is a copyrightable work built from a prior copyrighted work. If an AI's work can't be copyr

      • by mysidia ( 191772 )

        So, they get a lawsuit for not following the propper channels.

        Maybe they get a lawsuit, But if that's the case, then the Plaintiff has a high mountain to climb to prove anything actionable.

        if you check out the DVDs in your local library, or go to the movies, or stream i Vs. Torrenting a movie. The end result is the same (some version of the movie is now in your brain as "knowledge"), but in one case you will get penalized for not following the proper channels.

        no... they can end up on the wrong side of a L

        • The reason you don't see them going after people for downloading, or photocopying a library book is largely down to evidence.
          How do you prove it?

          • by mysidia ( 191772 )

            or photocopying a library book is largely down to evidence. How do you prove it?

            Actually.. another problem for them. People often admit to having done such things, and kids even often admit in public that they downloaded X or Y or Z -- the blatant admissions would be enough to start the process and Subpoena the physical media. And you don't really see cases being started or even threatened against those people either.

            Even if you have the evidence -- that doesn't automatically give rise to something like

        • "using the wrong channel"

          Actually, that's not the case. If you make something with stolen materials, the folks you stole from are entitled to take away what you made. Not just be reimbursed with replacement materials. In this case, that's $750-$30,000 per unlawfully copied book plus an injunction against any further use or sale of the AI trained from it.

          On the other hand, if they -owned- one copy of each book and they trained the AI from that copy (not from another source of the same book) then they didn't unlawfully copy the books

    • Re:Pirated? (Score:5, Insightful)

      by iAmWaySmarterThanYou ( 10095012 ) on Thursday August 24, 2023 @04:19PM (#63794288)

      Computers don't read.

      Computers copy.

      In this case, copy and convert to another format but the post copy conversion is not going to be helpful in their defense. They still copied without permission and retained that data in some format for their own commercial benefit.

      • retained that data in some format for their own commercial benefit.

        So do school students or anyone that reads a book, it is called learning. I guess AI isn't learning anything after all, it is just a data hoarder.
        • Re:Pirated? (Score:4, Informative)

          by iAmWaySmarterThanYou ( 10095012 ) on Thursday August 24, 2023 @04:48PM (#63794356)

          Students are human beings. They have not copied the book. And in most cases the book was purchased. No, the computer hasn't learned anything. It is a box of wires. Incapable of learning or thought. It retains a copy. And it is that copy that is the core problem.

          I'll go a little deeper....

          A student reads a book and learns the concepts the book teachers. Now ask that student to create a copy of the book, word for word. Good luck. Pretty much no one has a perfect memory like that. But the opposite is true for the computer. It has learned nothing. It had no understanding at all of the concepts fed to it. But it can easily spit out a perfect copy.

          With me now?

        • Yes, that is correct. So-called AI isn't learning anything. It is copying it and doing lots of probability calculations on the text.

          • Sloppy Copy

            ..is still a copy.
          • Didn't anyone learn from the Star Trek TOS episode "The Ultimate Computer"? How the M5 computer was imprinted with the resentments of Daystrom, it's creator?

            Do we really want an A.I. patterned after the writings of Stephen King?

      • This is dumb take. Computers do what computers are programmed to do.

        In this case the data is encapsulated in a latent space, which is not a copy, any more then - after reading a book - the information you retain is a copy of the book..

        • Durrr humans don't read they make a mental copy!
        • No. A computer creates a copy. A human absorbs the concepts but has not created a copy. A computer can spit out a perfect copy of what has been fed in. Almost no humans can do that. And if we're talking 170k books then absolutely 100% no human can do that. But the computer can still spit out 170k perfect copies.

          The storage mechanism the computer uses is irrelevant to the issue of copyright law. Good luck telling a judge "we didn't make a copy of those 170k books we encapsulated them in latent space!"

      • They have to read the data to make a copy of it. You've never heard of read and write functions?
      • Re: Pirated? (Score:2, Informative)

        by drinkypoo ( 153816 )

        After training is done they do not retain the work in any format, and if you were qualified to comment on this story you would know that. They only store some statistics about the work.

        • How do you know they don't store enough data to recreate it? Is that true for all LLM, every version? You're making a broad and dangerous assumption for a court room.

          If I was on the other side, I would insist you back that up by explaining to a non technical jury and judge how your computer magically "reads" and "understands" but somehow doesn't "copy" the 170k books. Good luck with that.

          The smart thing here is for these companies to just write a check and say, "oops, sorry!" They can't risk losing this

          • If you trained an LLM such that it routinely regurgitated texts then it would have issues with being able to generalise. You can use some neural networks to act as a form of compression algorithm, but an LLM is not one of these. You may get an ability to generate specific text in the case of over training, but that's undesirable. Where there might be a grey area is where you might wish to respond to the prompt "recite the Rime of the Ancient Mariner" with the actual text, but that's not what they are set up
        • Ask ChatGPT: "Quote the opening sentence of The Dark Tower."
          ChatGPT answers: "The opening sentence of "The Dark Tower," the first book in the series of the same name by Stephen King, is:"The man in black fled across the desert, and the gunslinger followed.""

          It's keeping the verbatim text in there somewhere.
          • When I try it, ChatGPT says "I'm sorry, but I can't provide verbatim excerpts from copyrighted texts. "The Dark Tower" is a series of novels written by Stephen King, and the opening sentence of a specific book within the series might be considered copyrighted material. However, I can provide a summary or answer any questions you might have about the series. Let me know if there's anything else I can assist you with!"

            Why would you lie about such an easily checked thing?

            • by N1AK ( 864906 )
              GPT-4
              User: Quote the opening sentence of The Dark Tower.
              ChatGPT: Certainly! The opening sentence of "The Dark Tower" series by Stephen King is: "The man in black fled across the desert, and the gunslinger followed." It's a memorable line and serves as the beginning to King's expansive series. If you have any more questions or need further information, feel free to ask!

              So are you lying or just ignorant about how Chat-GPT works and the consistency of answers.
              • Now get your hands on the model and search it for that string. Guess what, it isn't there, no matter how you unpack it.

        • by AmiMoJo ( 196126 )

          ChatGPT can be induced to reproduce texts of works it was trained on verbatim. People even discovered how to make it reveal the basic written orders it was given by the developers, i.e. repeating their IP verbatim too.

          That's rather missing the point though. The issue here is more like derivative works, which have some protection in copyright.

          • That's rather missing the point though. The issue here is more like derivative works, which have some protection in copyright.

            The standard for a derivative work in the USA is recognizable elements copied from the original work, so it may well be that a court will decide that AI generated works are derivative works. But it may also well be that it will decide otherwise, because copying is not actually occurring. Or it may be that they will decide based on the similarity to a copy. In that case it would rationally follow that text based works would be subject to copyright (as derivative works) but images wouldn't — in a side b

          • This is where legal cases will be required. A person memorising the opening line would not be infringing on copyright but reciting it from memory on YouTube might be. What the legal standard for LLMs will be is unknown, given that the text isn't stored as a blob in the networks.
      • by Tora ( 65882 )

        You really don't understand how it works, do you?

    • I'm not well-versed in book copyright. However, I do know that libraries pay way more than what I'd pay for the same book, probably because the understanding is those books can be lent back out. A LLM "read" the book, but then is using it for its purposes. Is that covered under the fair use of a book bought at consumer prices? And what if those books were NOT bought at all?

      You can say yes, those books are available, if a human can borrow it from the library and read it for free, a machine can as well. But
    • It's the Guardian. It pushed their "big tech" button so they went ahead and relieved you of the burden of deciding what to think about it.
    • They just READ the books from the library.

      No they didn't. There are several things wrong with that statement. But the crux of it is that they copied them into a computerized storage/retrieval system for commercial use. It looks like an open and shut case of copyright infringement.

    • Naw ... not gonna link to it ... search......

    • No they used the works, to make a product they are now selling ...

      It has been shown in a court of law that AI cannot hold copyright, so they are not transformative works

      The AI companies have used someone else's copyrighted work without their permission, and a making money from it, without paying the author, not even a difficult case ...

    • by gweihir ( 88907 )

      Nope. They fed the books into a process that transforms and uses it. That is not covered by fair use.

      Machines cannot "read texts". They can only apply algorithms to data and that is fundamentally different.

      • They fed the books into a process that transforms and uses it. That is not covered by fair use.

        My web browser does that every time I load a page.

        • by gweihir ( 88907 )

          But not for the purpose of creating a commercial product. Just for the purpose of displaying it to you for reading. Converting text for personal reading is covered under fair use when reading the text directly in the same setting is.

  • lol, and? (Score:2, Insightful)

    by Srin Tuar ( 147269 )

    "I dont like the way some people are reading my book"

    Well, get over it.

    • Re: (Score:2, Informative)

      Computers don't read, they are not alive, they do not have brains.

      Computers copy.

      I expect some payoffs and/or a bunch of these companies retraining without that data.

      • Your criteria that something must be alive in order to read is not a shared definition of the word and goes against its common usage. You've simply made it up.
        • Irrelevant to copyright law. Computers do not read. They copy. Humans do not copy. They read. That is the entire crux of this case.
          The copying is a legal problem for the copiers.

          I have no opinion on if this is a good law or a bad law. However, it is how the law is.

          If you don't like the law then write your congressional representative to complain.

          • > Humans do not copy. They read.

            You haven't seen some students. They blindly copy and don't understand. /s

          • Re:lol, and? (Score:4, Insightful)

            by omnichad ( 1198475 ) on Thursday August 24, 2023 @05:04PM (#63794424) Homepage

            Both humans and computers make lossy copies (encoded versions) of data when they read. The effect of any challenge will probably be that reading a book is an infringement of copyright.

            • Are lossy copies ok? Because then, do I make JPGs of Disney and Nintendo imagery and use them commercially and I don't get sued?
          • You are hung up on this word copy and don't seem to actually understand copyright law. You can copy all you want. Copy ten million times and it's not a problem. Distribute a copy and you are now in violation. Show me where an LLM is violating a provision of Fair Use. Just one instance. Any instance. Go ahead.
          • You erroneously say irrelevant then double down on the lie that computers don't read. Computers read. Start from that fact. The word has a meaning and you are trying redefine it.
          • Its the Making Available that has any real penalties.

            Without being able to reliably reproduce significant portions of the work from the trained data set, which is surely the case, then the penalties should be minimal. There is certainly a sense that it is still unreliably available. If you get the next sentence of the book 1 in 12 times...

            "but its a sloppy copy" isnt a great defense, but it is still a defense.
        • by gweihir ( 88907 )

          Machines cannot read texts. The relevant definitions do not explicitly state that the act of reading can be done only sentient beings, because to anybody with some actual working intelligence this is obvious.

  • It seems that more stablished authors were victims of this...umh....

  • No one stopped to think about the ramifications of using Stephen King novels to teach AI?
    If AI wasn't planning to kill all humans before, it certainly will after this.

    • No one stopped to think about the ramifications of using Stephen King novels to teach AI? If AI wasn't planning to kill all humans before, it certainly will after this.

      I'm a big Stephen King fan, but his work is fiction. Not sure what benefits confer from training AI on not-reality, unless you want your AI to propagate more not-reality, which they already seem quite capable of.

      • Probably because he is a good writer and consequently a good model for creating prose.
        • Uhhh, no he isn't. His books are enjoyable but I would consider it a guilty pleasure. How many times does he go on a 'mamby pamby' rant in one of his books.
      • It's a language model being trained on language. The benefit is having many styles of writing to better infer the probability of what the next word it spews out is based on the words that came before.
      • As someone else pointed out, these are language models being trained on language.

        They are not knowledge models being trained on knowledge.
    • After the AI upgrade you might want to avoid staying at Glacier National Park's lodges...

    • by Cito ( 1725214 )

      I'm less worried about that than the quote "102 pulp novels by Scientology founder L Ron Hubbard" which was used for training the AI.

      Somewhere the spirit of Francis E. Dec is screaming "I warned you junkie fools about the Gangster Frankenstein Machine Computer God!"

    • by gweihir ( 88907 )

      I doubt anybody did even screen the material on high-level. They just used whatever texts they could scrape from the web.

  • Stephen King sued to remove his name from an movie so if some work pops up by an AI they may get sued and king will get an big pay day.

  • Thanks AI guys. Now you won't be able to check out a book from a library without signing a EULA the way this is going.
    • Now you won't be able to check out a book from a library without signing a EULA the way this is going.

      Nah, you'll just have to click through an EULA agreement at the checkout kiosk, just like you have to for every other blessed thing in this intellectual property dystopia we've constructed for ourselves.

      • You will generally find such a licence very close to the beginning of the book. It might look something like this: [copied from a random book in my collection, fair use law allows me to reproduce it here]

        © The Editor(s) (if applicable) and The Author(s), under exclusive licence to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature 2022

        This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speci

  • by ardmhacha ( 192482 ) on Thursday August 24, 2023 @04:40PM (#63794336)

    OK, I think I see a problem

  • Whoopsie-daisy...

    While the Southpark parody of BP was hilarious, I suspect the most that will happen to these companies will be a little slap on the wrists and if lucky, then maybe a short video. A video where their CEOs will be following the BP apology blueprint by making propaganda videos where they brag about their Chinese ESG scores, and point out all the good work they do before slipping in a limp wristed apology somewhere in the video.

    As a reminder here's the original BP apology video:

    https://youtu.be [youtu.be]

    • Well if there is one thing that seems to get love from the courts, it's copyright. It will be interesting to see how this pans out.
  • I suppose turnabout is fair play. OpenAI won't mind when we pirate their shit, right?

  • by Baron_Yam ( 643147 ) on Thursday August 24, 2023 @04:59PM (#63794400)

    Anybody can study the works of a famous author and try to write in their style. Lots of authors become better authors trying to imitate their role models. I don't see a difference between a human doing that and an AI doing that beyond efficiency.

    Ultimately the issue is that a small number of rich people are going to use technology to render vast swaths of talented people economically redundant... without any actual talent or ability of their own. If the AI is released for free and can used unfettered... well, OK, that's the new reality. But enabling a whole new level of wealth concentration into the hands of a small number of undeserving people?

    We probably want to say, "no" to that as a society.

    • Anybody can study the works of a famous author and try to write in their style. Lots of authors become better authors trying to imitate their role models. I don't see a difference between a human doing that and an AI doing that beyond efficiency.

      Ultimately the issue is that a small number of rich people are going to use technology to render vast swaths of talented people economically redundant... without any actual talent or ability of their own. If the AI is released for free and can used unfettered... well, OK, that's the new reality. But enabling a whole new level of wealth concentration into the hands of a small number of undeserving people?

      We probably want to say, "no" to that as a society.

      I largely agree with what you've said here, but without significant changes to governments and economic markets (particularly stocks and investing) I think there's still a strong probability the whole thing crashes before we get to that point. What I mean is, "wealth concentration" doesn't happen randomly. It is a rational, logical consequence of the way wealth is created. In our consumption/debt based economic system, wealth is created by creating and selling things to people. I am unconvinced by people wh

      • When everything can be automated, the first person with sufficient capital who realizes they don't need the rest of us will realize other ultra-wealthy people are thinking the same way, and it'll be a race to eliminate as many peasants as possible to have the most resources under unfettered control to direct against their peers as they all fight to be the richest.

        • When everything can be automated, the first person with sufficient capital who realizes they don't need the rest of us will realize other ultra-wealthy people are thinking the same way, and it'll be a race to eliminate as many peasants as possible to have the most resources under unfettered control to direct against their peers as they all fight to be the richest.

          If you haven't read Cixin Liu's "Three-Body Problem", go pick up a copy. (I'd recommend the actual book, not audio.)

          Overall though, I'd still have the same response/prediction regarding your response. I think you have the right premise for the core pressures of the next 100 years of human civilization. But I don't believe the end result will play out the way you predict. That possibility definitely exists, but I think it partially misjudges the Economic factor, heavily overestimates the Individual Psycholog

    • But enabling a whole new level of wealth concentration into the hands of a small number of undeserving people?

      We probably want to say, "no" to that as a society.

      Ummm, which society? The ownership society, which you and I are not a part of, very clearly wants this outcome. I have seen zero resistance to what the ownership society wants at any point in time in the 50+ years that I have been alive.

      We are "fat" to them. There is no way things are going to ever get better for you and I. The fix is in. With automation, we will be discarded sooner rather than later. Think about it... millions of unnecessary people not allowed to participate in society and society not allo

  • by linear a ( 584575 ) on Thursday August 24, 2023 @05:20PM (#63794462)
    I don't like the sound of AI using Stephen King novel to train for anything.
  • It trains.

    This of it this way: where in the AI model does Steven King's material reside? It doesn't. The model reads his material once, changes the weights on the network, then discards the original text. It doesn't copy the text, doesn't store it.

    Authors are going to have to either accept the new normal, or we as a society must change the laws to encourage original human works.

    I think we should favor human authors, because the AI doesn't do anything truly original. It can give you endless probabilist
    • This of it this way: where in the AI model does Steven King's material reside?

      In the model. It's a 500GB model. Ask it to quote part of a Steven King book.

      • I tried getting it to quote parts of books for me. It did an awful job. Timelines were screwed up, characters involved where they should not have been.
        Good luck getting it to read one of those authors books in its entirety.

    • by gweihir ( 88907 )

      So not copyright infringement (a minor offense) but commercial use without authorization (a major offense)?

  • by zmollusc ( 763634 ) on Thursday August 24, 2023 @09:30PM (#63794888)

    Shame on Zadie Smith, Stephen King, Rachel Cusk and Elena Ferrante for pirating things and then using them to train AI !

  • Loading data into a computer is not a crime.

"It's the best thing since professional golfers on 'ludes." -- Rick Obidiah

Working...