Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
AI

OpenAI Is Faulted by Media for Using Articles To Train ChatGPT (bloomberg.com) 89

Major news outlets have begun criticizing OpenAI and its ChatGPT software, saying the lab is using their articles to train its artificial intelligence tool without paying them. From a report: "Anyone who wants to use the work of Wall Street Journal journalists to train artificial intelligence should be properly licensing the rights to do so from Dow Jones," Jason Conti, general counsel for News Corp's Dow Jones unit, said in a statement provided to Bloomberg News. "Dow Jones does not have such a deal with OpenAI." Conti added: "We take the misuse of our journalists' work seriously, and are reviewing this situation." The news groups' concerns arose when the computational journalist Francesco Marconi posted a tweet last week saying their work was being used to train ChatGPT. Marconi said he asked the chatbot for a list of news sources it was trained on and received a response naming 20 outlets including the WSJ, New York Times, Bloomberg, Associated Press, Reuters, CNN and TechCrunch.
This discussion has been archived. No new comments can be posted.

OpenAI Is Faulted by Media for Using Articles To Train ChatGPT

Comments Filter:
  • by dmay34 ( 6770232 ) on Monday February 20, 2023 @10:15AM (#63308089)

    I understand why these companies and their artists/writers are concerned and upset, but OpenAI has at least 3 very strong fair use claims in their favor.

    1) Educational, the articles were used to train the AI. The Output is not a copy.
    2) The use of the articles is different than originally intended, to train an AI instead of sale of publication.
    3) The output is decidedly transformative.

    Basically, if the publishers win, then ANY use of any copyrighted material is in violation. Artists would not be able to save a copy of a photo to create transformative art from.

    • Re: (Score:2, Interesting)

      by AmiMoJo ( 196126 )

      (1) Does educational use get them an exemption from copyright rules? I don't think teachers are allowed to copy entire articles just because they are using them to educate children, but correct me if I'm wrong.

      Of course the first hurdle is getting a court to agree that it is education, and not some kind of engineering process. Education generally only applies to humans, with even animals being trained rather than educated.

      (2) I don't see the relevance of. If anything it's an argument in the publisher's favo

      • Re: (Score:3, Insightful)

        by Immerman ( 2627577 )

        It doesn't really matter *what* they use the articles for. Copyright prevents people from redistributing the original or derivative works, or (sometimes) displaying/performing the work publicly. That's it.

        And the AI systems aren't doing any of that.

        If you can't point to a derivative work as say "look, this part right here is obviously copied from this piece of my work", then it's all but guaranteed to NOT be considered a derivative work.

        More conceptual derivatives - "inspired by", "done in the style of", e

        • It does matter what they use the articles for. Specific purposes are exempted in the law, like education and parody.

          "It educated an AI" sounds like a legal contortion that will only work with a crooked judge, though. When I download a movie off the Pirate Bay, it "educates" my laptop on how to display it. The problem is my laptop is an inanimate object and it cannot be educated.

          • by Immerman ( 2627577 ) on Monday February 20, 2023 @11:12AM (#63308247)

            They are, but those are only relevant if you have otherwise infringed the copyright.

            Exactly how is letting an AI "learn" from your article infringing copyright?

            If you can point at an AI-generated work and say "this part right here was clearly copied from this work of mine", *then* you can make a copyright infringement claim on that particular AI-generated work. Until then you've got nothing.

          • Re: (Score:2, Insightful)

            by AmiMoJo ( 196126 )

            If it were allowed it would seem to open the floodgates. No point paying for an expensive celebrity voice actor, just "educate" an AI with their voice and get it to produce a "derivative" work.

            The only time that would work is for parody. Maybe we will see AI voices used there soon.

            • by Tx ( 96709 )

              If it were allowed it would seem to open the floodgates. No point paying for an expensive celebrity voice actor, just "educate" an AI with their voice and get it to produce a "derivative" work.

              Current copyright laws weren't written with this technology in mind though. You can't currently copyright the sound of your voice or your artistic style. Trying to stop it at the training phase, which is what we're talking about, saying "don't use our copyrighted works to train your AI", idk, I'm not a lawyer, so we'l

            • Re: (Score:2, Insightful)

              by drinkypoo ( 153816 )

              If it were allowed it would seem to open the floodgates. No point paying for an expensive celebrity voice actor, just "educate" an AI with their voice and get it to produce a "derivative" work.

              In the US we have the right of publicity [cornell.edu], and you get to control use of your likeness for non-fair-use purposes. So no.

              • Right of publicity is not copyright, and copyright laws cannot be used to directly support it. Also according to your own link only half of the states have such a law. I would bet the number which protect voice specifically is much smaller.
        • Copyright also prevents you from âoecopyingâ an asset and storing it on your internal hard drive. Eg. If you download a movie for personal use, that is making a copy, still illegal even if you donâ(TM)t further distribute (in most jurisdictions).

          Which is what ChatGPT is doing in some form here, they are copying the work for commercial use without paying for the privilege and storing it in a highly compressed format (given the output from ChatGPT, it does âreproduceâ(TM) exact copies

          • We're in the digital age - you cannot read this comment without having first copied it onto your computer.

            And copyright (as enforced) doesn't restrict *copying* it restricts *distribution* (including public performance, where you're "distributing" it into other people's brains). If you want to print stills from Disney images as posters for your walls, you're fine. Try to sell one of those posters (or give it away) and Disney can come down on you.

            And yes, if ChatGPT makes an exact copy of something (or cle

            • by guruevi ( 827432 )

              Correct, but the EULA of Slashdot, NYT etc probably states that I can only do this 'copying' for personal purposes and not commercial reasons, I can't copy and paste the data, put it in a chatbot and make it appear as if it was coming up with this stuff all by itself. Even just quoting or reciting from memory, I can't do commercially, pass it off as my own or reproduce without attribution unless there is some fair use clause.

              If I access NYT in the library, they often will have some sort of agreement that th

      • by dmay34 ( 6770232 ) on Monday February 20, 2023 @11:14AM (#63308251)

        The court precedent for #2 is Google books. Google was copying the books in their entirety, letting users search the books, showing the users clips of the books that matched their search criteria, and then offered to sell them the books.

        The courts found that all of this was fair use.

      • (1) Does educational use get them an exemption from copyright rules? I don't think teachers are allowed to copy entire articles just because they are using them to educate children, but correct me if I'm wrong.

        Here we get into the rather rubbery definition of "copy". If I read a news article on the Web, then do the contents of my computer screen constitute a copy? What about what's in my browser cache?

        Given that ChatGPT was likely trained by being given URL's, I suspect that any copyright claim here is not only dead in the water, it's dead before it even gets wet.

      • by Rei ( 128717 )

        What copywritable material, under copyright, can ChatGPT produce verbatim? I add that adjective because basic facts are not copywritable. Please be specific.

        I'll note that Google Books scanning in copyrighted books, en masse, without permission, and showing blurbs and even whole pages to users without compensation, was deemed by the courts to be transformative and fair use.

        • The legal standard is substantial similarity, not verbatim. Don't feed the troll.
        • GitHub Co-Pilot and ChatGPT both have reproduced my own code (which is a small set of open source code in a very niche field) to a level of what most would consider outright plagiarism (including mistakes). As long as you are precise enough with your search terms, you can basically get to a plagiarized version of the âsource materialâ(TM) which Iâ(TM)m assuming is what NYT and co is going to have to prove in court,

          • by cowdung ( 702933 )

            I can believe this.

            Because Neural Nets are a form of lossy "compression" in one sense.

            So at times it could reproduce code verbatim (or text from an article) or paraphrase without giving credit (plagiarism).

            So this is a bit of a legal minefield.

            It will be interesting to see how it can be resolved.

          • by Rei ( 128717 )

            We're not talking about GitHub Co-Pilot, we're talking about ChatGPT. Let's see it. Come on, how hard is it to be specific?

            • by guruevi ( 827432 )

              [i]GitHub Co-Pilot and ChatGPT both[/i]

              Perhaps learn to read.

              • by Rei ( 128717 )

                And I'LL repeat, so YOU learn to read: we're NOT talking about GitHub Co-Pilot, so stop trying to introduce it into the conversation. Rather than trying this straw man / red herring approach, you have a simple task: present ACTUAL EXAMPLES of CHATGPT (not GitHub Co-Pilot) reproducing copywrited material in a non-transformative manner.

        • by Rei ( 128717 )

          The legal standard is how substantially transformative it is, the amount of reproduced materials, the goals of reproduction, and a number of other factors.

          But let's see some examples. A claim has been made, let's see them.

      • (1) Does educational use get them an exemption from copyright rules? I don't think teachers are allowed to copy entire articles just because they are using them to educate children, but correct me if I'm wrong.

        Yes, they can [university...fornia.edu], if the entire article is needed for an educational purpose.

      • "Does educational use get them an exemption from copyright rules?"

        I'd have to think not. Otherwise, textbooks wouldn't cost hundreds of dollars since students would/could scan them and make free copies without consequence.

    • by DarkOx ( 621550 )

      Of course technically speaking its not the same thing but if a professor assigned their article as class reading I wonder if they would be upset?

      Would they be upset if students or anyone else for that matter accepted and explored their ideas. We generally cite facts, and opinions from other authors but we don't generally cite broad widely reproduced ideas and concepts. You would not put a citation after the sentence 'Many people believe the ability to set interest rates is an important monetary tool.' you

      • by cowdung ( 702933 )

        But what part of GPT makes sure that it isn't paraphrasing or that it limits itself to "general knowledge".

        Quite to the contrary it very well may copy niche information or expression so the argument could be made that it can violate copyright and/or anti-plagiarism standards.

        This is an area where the law will need to legislate, and in the present climate where politicians see tech companies "not doing enough for X or Y", I don't see it going well for tech companies.

    • by Joce640k ( 829181 ) on Monday February 20, 2023 @10:59AM (#63308195) Homepage

      I wonder where those journalists got their writing ideas from?

      Was it from reading other journalists work?

      I'm sure they didn't learn in a vacuum.

      • Eaxactly. If I learn what certain financial terms mean by reading the Financial Times, I don't owe the publisher for training me.

      • Apparently, I owe tons of publishers extra money for reading their books, journals, websites and using that accumulated knowledge in my career. Go figure. Wish i knew earlier, wouldn't have read so much.

        also a cautionary tale for teachers if they use any online resources or books to teach students.

    • Speaking of education, if someone wants to learns a skill, they need a teacher who is usually paid unless they're intelligent enough to deduce the process themselves.

      An AI that doesn't know anything needs training because it can't teach itself, thus it also needs a teacher, which are in this case the writers of articles.

      • Speaking of education, if someone wants to learns a skill, they need a teacher who is usually paid unless they're intelligent enough to deduce the process themselves.

        If that teacher makes you read a load of history books and you go on to become a history teacher, do you have to pay him/her and all the book authors royalties for all the things they thought you?

        Does everybody pay royalties to their university after they get a job and start earning money?

    • by cfulmer ( 3166 )

      You don't even need to get to fair use. Copyright protects a set of exclusive rights: the right to copy, to prepare derivative works, to distribute and to display publicly. If you're not doing one of those, then you're not infringing. Further, copyright only protects *expression* -- the underlying *ideas* are not protected. Derivative works are things like translations, screenplays, and so on. ChatGPT isn't doing that, AND it's not re-using the original expression.

      • copyright only protects *expression* -- the underlying *ideas* are not protected. Derivative works are things like translations, screenplays, and so on.

        Derivative works are defined by a combination of their origin, and recognizable elements from it. If ChatGPT is trained on something, and then produces something literally indistinguishable from that thing, then there is an argument to be made that it's violating copyright. And it's actually capable of doing that, because unlike the image-generating diffusion models, it's producing text output.

      • Comment removed based on user account deletion
    • An LLM is essentially an algorithm trained on a corpus. There are dozens of publicly available corpora that have taken samples from all kinds of copyright sources, which have existed for decades, e.g. COCA & BNC (which includes the NY Times & Reuters, BTW). I don't see anyone try to sue the people who compile & encode corpora.

      What's more, from what I understand, unlike a corpus, LLMs don't retain the copies of the texts in the corpus, only the resulting processing parameters after training. I
      • by cowdung ( 702933 )

        The problem is that modern language models such as GPT doesn't just retain the rules of grammar, but actually retains the data, the style and expression of millions of articles. And thats why it can reproduce some of it verbatim if it's rare enough.

        • Most people's understanding of what language is is wrong & LLMs & how they work are a demonstration of that. The thing is, it's pretty complex & difficult to understand, e.g. it's not constrained by grammar rules (contrary to Chomsky's conjecture about "Universal Grammar"). If you're really interested in how the structure of language develops in our minds (according to cognitive linguistics), here's a crash course but it requires some background knowledge of linguistics to get to grips with it:
    • And they have one very strong argument against them: They copied the copyrighted works, and that copying diminished the value of the copyrighted works.

      That they violated copyright is not disputed. That the use was "fair use" is what OpenAI will have to establish to defend themselves against the claim of copyright infringement. Since OpenAI can negatively affect the value of the original authors' works to such a large degree, effectively making it worthless, OpenAI has a rather steep hill to climb.

      • by dmay34 ( 6770232 )

        OpenAI's best argument is Authors Guild, Inc. v. Google, Inc.

        In that court case, Google, through the Google Books website, was copying the books in their entirety, saving them in their entirety into a database, letting users search the books for free without notice or license to the publishers or authors, showing the users images clips of the published books that matched their search criteria, and then offered to sell them the books.

        Courts decided that was totally cool.

        • Comment removed based on user account deletion
          • by dmay34 ( 6770232 )

            ...because Google made an agreement with the publishers and because of the implied subtext that Google wasn't actually damaging the market for the books.

            No, that's not what happened. The deals you mention that google tried to cut fell through (and probably would have been met with major anti-trust hurdles). The lawsuits went to trial and Google won everything.

        • Because Google only showed 1 image from a book. It was narrow enough not to be a problem, because a book is 300+ pages so youâ(TM)re impacting less than 1% of the piece.

          ChatGPT, given enough time and prompts can reproduce pretty much any article, if the limits are removed, it could probably even reproduce the books it has stored to a very high degree of accuracy. The question is whether rewording and reproducing an article on any particular subject is copyright infringement (give me the opinion on cert

          • by dmay34 ( 6770232 )

            ChatGPT cannot reproduce any books because that's not how the data is stored in it's database. There are no complete works stored. It stores and processed the data through patterns. When you type in a prompt, it runs an algorithm on it's pattern recognition software. This is why it has such a hard time reporting correct sources, and why asking for sources is the easiest way for teachers to catch students using ChatGPT for essays. No sources of any kind are stored to be referenced.

            • by guruevi ( 827432 )

              I don't believe that, it stores relations between words, which is what a book is. I don't write books, but I have seen it produce code and text verbatim from somewhere on the Internet that I myself published. Others claim it can produce similar texts to existing text.

              Now, it may not 'knowingly' have done that, it simply stores the relations between search terms and if your search term is niche or precise enough, that relation is close to 1-on-1, so it reproduces that 1 relationship it "knows" in its databas

    • Comment removed based on user account deletion
    • by Kisai ( 213879 )

      And that's what they want. They want Google to pay them for linking to their articles, and they want twitter to pay them for linking to their articles behind paywalls, which makes their links as good as spam.

      Burn down any media that tries to have it both ways, a subscription model and an ad model.

      I'm not saying you shouldn't pay for WaPo and NYT, but these companies kinda have no trust to them since they put the article behind a paywall, thus allowing misinformation to flow around it, since free news sites

  • by RKThoadan ( 89437 ) on Monday February 20, 2023 @10:18AM (#63308095)

    Does this AI expose whether this is factual results or just derived from it's language learning algorithm? My understanding is you cannot take anything it says, even about itself, as being "real".

    • They really going to submit its own answer as evidence that it was trained by the WSJ. Do they have 0 idea how large language models work?
    • by AmiMoJo ( 196126 )

      I'm surprised that Microsoft didn't make it include sources when they integrated it into Bing. I know nothing about the ChatGPT API so maybe it can't simply supply a list of sources.

      • by laughingskeptic ( 1004414 ) on Monday February 20, 2023 @06:35PM (#63309675)
        The text becomes a pile of vectors in multiple stages that are related to each other using various error functions. One of the big problems in AI/ML is "explainablility". DNNs as we build them today do not lend themselves to explainability. It is generally not possible to work backwards from an output and understand how the output is driven by the input. This is because trying to track labeling metadata such a source document ID through the entire training process for every change to to the internal state of an input during training would require more memory than the universe has atoms. Much simpler ML methods like Tree-Based ML methods are highly explainable because the "IF-THEN" operations that led to the result can be extracted for a given input. However these methods do not produce as good results for complex problems as DNN methods.

        In some specific cases a human can identify what in the training set leads to certain behaviors. For instance, an image of a riot involving tear gas might be identified as "water buffalo" by ImageNet. Why? Because many pictures of water buffalo involve a dusty haze. So we learn that "dusty haze" was learned by the network to be a feature of "water buffalo". A human can figure this out after the fact, it is impossible for the network to understand the concept of "dusty haze" since that was never a training input much less track this decision back through the 21 DNN layers to 50 of 100 images of water buffaloes training images that contained dusty haze.
        • It is definitely possible, you can assign labels to your sources and make sure the labels come through, thatâ(TM)s how we debug our neural nets. Obviously, as you say, the labels will end up being a long list for anything complex but you could limit to the first 5 highest scoring hits and then give the option to scroll down. it will at least give some transparency.

    • by fermion ( 181285 )
      Not if it is trained using the WSJ. What it will learned is the last US was fraudulent and stolen. Lower taxis will spur the e onto y and eliminate the debt. Workers paid more will just waste the money buying meat. Natures; disasters are never a crisis as you can just go to your second home or Cancun.
  • by Miles_O'Toole ( 5152533 ) on Monday February 20, 2023 @10:27AM (#63308107)

    They're using mainstream news media to train chatbots? No wonder they're turning into something like David 8 from Prometheus.

  • by wokka1 ( 913473 ) on Monday February 20, 2023 @10:31AM (#63308115)
    WSJ certainly doesn't mind the search engines using/scraping their data. I know this is over simplifying the concept, but the point stands.
  • First it was invasive ads, then paywalls. After that came them getting governments to pass laws requiring social media and search engines to pay them when they are linked. Now they are sueing OpenAI because its ChatBot said that it was trained its websites? How long until they just directly take $3 a month out our bank accounts to "keep journalism alive"
  • by Framboise ( 521772 ) on Monday February 20, 2023 @10:41AM (#63308149)

    Humans, like WSJ journalists, do the same than chatbots, sometimes even with better skills.

  • by Flytrap ( 939609 ) on Monday February 20, 2023 @11:06AM (#63308219)

    These companies don't own the news... they never have. They used to control the distribution of the news content in the days when owning a printing press, controlling an efficient physical delivery mechanism and having access to prime space on street poles, grocery stores and busy sidewalks was a prerequisite to being able to operate a successful newspaper or magazine business. Of course I am not forgetting the importance of the people who ran around and collected and bundled all this content into interesting news stories that we would want to read... but you know what I mean.

    I thought that this had been settled in the late nineties (or was it the early noughts) when these businesses realised that the internet was a way more efficient and faster distributer of the same time sensitive content (ahem... news) that they were peddling a day later. I remember almost every news outlet (even CNN) experimented with tryng to turn ordinary joe (I should also say "and jane" just to be politically correct) into a citizen journalist - they were usually the source of the story to begin with. But soon hundreds of thousands of people realised that they could tell their own story... skip the middleman. Ahhhh... I miss the internet chaos of the nineties, when blogs were cool and awesome.

    I think that the few news businesses that hire passionate journalists (who need to be more than just content gatherers) who do real jourlalism (that stuff that is taught in journalism schools) - as opposed to just reporting and distributing the news - are thriving. Why... because... well... the internet thingie got out of hand... who knows what is real and what is fake these days - it's just your intepretation of what you think are the facts vs my intepretation of what I think are the facts.

    Most of these companies are in the business of vacuming up new interesting content that is already out there, curating it to fit whatever bias their audience holds, calling it news and shoveling it out. Just because the public is willing to pay for that curated delivery (because heavan forbid that we should inadvertantly come across a view point from a source that is contradictory to our own bias) does not give them ownership to the content... just its curation and delivery.

    Just my 2 cents worth.

    • They don't own the news, but they do own the stories written about the news. Under copyright, any written work, regardless of topic, is copyrighted unless the author specifically states that it is public domain.

  • There's no difference with a human reading news to gain intelligence - which is not additionally paid for.

    If OpenAI is using NYT to learn, they absolutely should pay the $14/mo for the net to have a subscription. That's obviously fair.

    They should also mark NYT as adversarial in their GAN and teach the AI how Fake News is lies and propaganda.

    If you tell it that NYT got a Pulitzer for covering up the Holodomor perhaps it can find patterns in other coverups over time that we've never discovered.

    That's the sys

  • This is like saying you're mad that Google didn't pay to index your article. You idiots, this is how you get eyeballs on your article.
  • "Major news outlets have begun criticizing OpenAI and its ChatGPT software, saying the lab is using their articles to train its artificial intelligence tool without paying them."

    I train my brain the very same way!

  • Section 107 calls for consideration of the following four factors in evaluating a question of fair use:

    Purpose and character of the use, including whether the use is of a commercial nature or is for nonprofit educational purposes: Courts look at how the party claiming fair use is using the copyrighted work, and are more likely to find that nonprofit educational and noncommercial uses are fair. This does not mean, however, that all nonprofit education and noncommercial uses are fair and all commercial uses a

  • by OldMugwump ( 4760237 ) on Monday February 20, 2023 @01:19PM (#63308661) Homepage

    Newspapers are used to train humans all the time. Schools use them; always have. Using newspaper contents to train AIs is no different, legally.

    Once I buy their paper (or get access to it online), I'm free do anything I want with it, excepting only copying it and giving it to others (because there's copyright law about that). There are no other legal restrictions on use.

  • With all the hype about various media bodies using ChatGPT to write articles, are we now going to see ChatGPT being trained by ChatGPT output?

    What horrors will be produced by this incestuous vicious circle?

  • by drkshadow ( 6277460 ) on Monday February 20, 2023 @01:32PM (#63308717)

    It's pretty apparently clear that ChatGPT is not outputting these news articles any more than you or me describing what we heard recently -- using _completely_ different rhetoric.

    Facts are not copyrightable. The rules of a game are not copyrightable. The presentation and representation are not copyrightable.

    It would be a hard sell to say that the transformation of the facts into another new output is copyrightable, otherwise everything that you or I say or think or put out is a derivation of something that someone, sometime had a copyright to.

    • No, but text written about facts certainly is copyrightable. If ChatGPT is regurgitating copyrighted text written by news or other publishers, they might be in violation of copyright.

  • Some of the companies complaining have paywalls, so why aren't they already covered? Seems like more of a problem for advertising-supported content, since the AIs probably don't have any spending money.

  • by account_deleted ( 4530225 ) on Monday February 20, 2023 @01:54PM (#63308803)
    Comment removed based on user account deletion
  • Expect everything from the guys who don't want to ponder on how they could do better using a new tool.
  • The only reason people even kind of trust Wikipedia is the religious use of citations. AI powered chat bots should use something similar.
    • by cowdung ( 702933 )

      That is a challenge to be resolved.

      Right now it seems quite difficult. Maybe the back propagation would need to keep track of what % each article added to each weight.. that would be very hard and space intensive.

  • If you put it on the internet it's no longer just yours.
  • Because I'm pretty sure that's all the publisher has a right to demand. If you pay for a subscription, you get to read the material. I don't see how a publisher can assert the right to charge extra for who or what reads the already-paid-for material.

"The one charm of marriage is that it makes a life of deception a neccessity." - Oscar Wilde

Working...