Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×
AI

'New York Times' Considers Legal Action Against OpenAI As Copyright Tensions Swirl 57

Lawyers for the New York Times are deciding whether to sue OpenAI to protect the intellectual property rights associated with its reporting. NPR reports: For weeks, the Times and the maker of ChatGPT have been locked in tense negotiations over reaching a licensing deal in which OpenAI would pay the Times for incorporating its stories in the tech company's AI tools, but the discussions have become so contentious that the paper is now considering legal action. A lawsuit from the Times against OpenAI would set up what could be the most high-profile legal tussle yet over copyright protection in the age of generative AI. A top concern for the Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff. If, when someone searches online, they are served a paragraph-long answer from an AI tool that refashions reporting from the Times, the need to visit the publisher's website is greatly diminished, said one person involved in the talks.

So-called large language models like ChatGPT have scraped vast parts of the internet to assemble data that inform how the chatbot responds to various inquiries. The data-mining is conducted without permission. Whether hoovering up this massive repository is legal remains an open question. If OpenAI is found to have violated any copyrights in this process, federal law allows for the infringing articles to be destroyed at the end of the case. In other words, if a federal judge finds that OpenAI illegally copied the Times' articles to train its AI model, the court could order the company to destroy ChatGPT's dataset, forcing the company to recreate it using only work that it is authorized to use. Federal copyright law also carries stiff financial penalties, with violators facing fines up to $150,000 for each infringement "committed willfully."
Yesterday, Adweek reported that the New York Times updated its Terms of Service to prohibit its content from being used in the development of "any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system."
This discussion has been archived. No new comments can be posted.

'New York Times' Considers Legal Action Against OpenAI As Copyright Tensions Swirl

Comments Filter:
  • A Modest Proposal (Score:4, Interesting)

    by spaceman375 ( 780812 ) on Thursday August 17, 2023 @08:54PM (#63776428)
    I've always felt that giving a corporation legal status as a "person" is a bad idea, but let's run with it. An AI would be the child of such a person. Children go to school for specific learning, via approved courses of knowledge. They also have full access to anything in the public domain, as well as a thriving ecosystem of media producers vying for their attention. They enjoy a protected status, and are grown on somewhat curated datasets. So why not do the same for AIs? Regulate what can go into AI training to turn it on it's head; if only certified content can go into a training dataset, people are going to start competing for that certification so their viewpoint/marketing can be included, just like publishers and special interest groups are very involved in what goes into textbooks and kid's education. Eventually we can talk about the "kid" reaching adulthood and gaining full responsibility for it's own actions. Until then, like a parent, the corp that is making it is liable for it's actions.
    • by ShanghaiBill ( 739463 ) on Thursday August 17, 2023 @08:59PM (#63776440)

      The problem with your analogy is that when a human reads a newspaper and learns from it, they aren't accused of violating copyright.

      • The human is assumed to have paid for that newspaper. So should the AI. Like licensing a seat for enterprise software if you need to train a lot of AIs.
        • by blue trane ( 110704 ) on Thursday August 17, 2023 @09:23PM (#63776490) Homepage Journal

          So if OoenAI has a subscription, it's fair use?

          • Fair or not, copyright holders place restrictions on what you can do with their material, even if you have paid for access.

            It's kind of like a "click through EULA" for software.

            The flip side of this is when a "recording artist" objects to their music used to introduce a political candidate of a differing party or point-of-view.

            If the campaign has paid all of the royalties and license fees to ASCAP or BMI, the "recording artist" can fume all they want, but they kind of signed away their rights by the

            • If a court holds that the ToS is valid, search engines will simply delist them off the web, as they all use machine learning following the scraping of pages to decide what is and is not relevant in response to search queries. If the court deems the ToS invalid, then even more people will see themselves as idiots for paying for their content. In both cases they will lose large amounts of revenue in addition to expensive legal fees, only to need to do this dance many times over with much larger companies too.
          • Signing up for the subscription would involve a contract and agreeing to terms. If they just scrape articles published open to the public they aren't agreeing to anything. Google recently won a decision regarding song lyrics it scraped. It was decided they hadn't agreed to the terms published on the site just because it was posted on some other page.
          • What if it has a library card?
          • Ha - the gray lady has no working way to subscribe to it.
      • The much bigger problem with the analogy is that ChatGPT isn't "learning" from the newspaper. It is using it to perform statistical analysis of what words are most likely to follow other words, and using that to programatically generate content.

        In other words, the text it ingests is the source code for the program it is running.

        So it is like taking the source code for Microsoft Office and compiling it for a different binary target. Yes the assembly output will look different, possibly completely different,

      • The problem with your analogy is that computers aren't people.
    • by AmiMoJo ( 196126 )

      I imagine that OpenAI is not keen on anything that even hints at AI having any similarity to a human child, lest people start thinking about the rights of that child.

      Besides, this is a well established area of law. There is an industrial process used to create AI, involving the accumulation of massive amounts of data and then processing it on a huge machine. AI companies like to use copyright and patents to protect what they make.

    • It seems to me that the question you're asking is, "Does the holder of a copyright have the right to say who or what can consume the copywritten material beyond requiring payment?"

      Granted, I may be inferring that improperly as it is the question I am asking, and the one I see as being the fundamental question in this case.

      • by tsqr ( 808554 )

        It seems to me that the question you're asking is, "Does the holder of a copyright have the right to say who or what can consume the copywritten material beyond requiring payment?"

        I think the answer to that question has nothing to do with copyright, and everything to do with whether the prospective consumer is a member of a protected class, at which point anti-discrimination law applies. If I've licensed an agent to manage distribution of my material and our contract doesn't specify restrictions, then the agent may have the right to restrict distribution as they see fit, again in compliance with law.

        And please - material isn't "copywritten"; it's "copyrighted". Sheesh.

  • Just buy it (Score:5, Insightful)

    by ShanghaiBill ( 739463 ) on Thursday August 17, 2023 @08:55PM (#63776430)

    OpenAI is owned by Microsoft.

    Microsoft has $110B cash on hand.

    NYTimes has a market cap of $7B.

    So just buy it. Shut down the dead-tree version and put the reporters to work writing fodder for LLMs.

    • Carlos Slim (pronounced with an Arabic inflection as "Ess-lim"), the telecom tycoon from Mexico who has family connections to the ultra-right-wing Christian faction in Lebanon, may not be interested in selling his interest.

      There are entities that even Microsoft needs to give a second thought to messing with.

    • by AmiMoJo ( 196126 )

      Then every other struggling publication will sue them in the hopes of being bought out too.

    • OpenAI is owned by Microsoft.

      Not sure where you got that idea, but it's untrue. While Microsoft is OpenAI's single largest investor, the two are actively competing against each other for cloud contracts: people can sign up for and use OpenAI's products as they already are, but Microsoft provides a separate (and confusingly named) "Azure OpenAI" product as a wrapper around OpenAI's core functionality, layering some additional APIs and other nice-to-haves on top.

      Bigger picture, OpenAI is still a non-profit organization that has Microsoft

  • by sinij ( 911942 ) on Thursday August 17, 2023 @09:48PM (#63776534)
    Such lawsuit might backfire by defining derivative works. This would not be advantageous for NYT.
    • by evanh ( 627108 )

      How do news organisations share stories? AP, for example, feeds many news publication world wide AFAIK. I imagine there is a fee for access to all that.

      Seems reasonable for OpenAI to be contributing to its sources.

  • How did a Slashdot editor not make that "America's New York Times"?

    • New York in England has a population of 150 and doesn't have its own local news paper, much less one called the Times.

  • by VeryFluffyBunny ( 5037285 ) on Friday August 18, 2023 @02:41AM (#63776936)
    To me this sounds like IP legal shenegans gone awry. What LLMs do is basically synthesis, i.e. taking information from a variety of sources and generating a summary or report. It's what teachers & lecturers have been asking their students to do for centuries. An LLM is essentially a machine that can do this quasi- mechanically (without actually understanding the meaning of the content of the texts).

    In short, if they're gonna go after AI companies, on the same principle, they're gonna have to go after schools, colleges, & universities too. In fact, anyone who does meta-studies, writes reports or literature reviews, or hands in academic essays.

    We have the media industry to thank for this state of affairs. I'm all for authors & artists getting their fair dues but that isn't what happens. 99% of creators get underpaid in precarious, unfair contract work (e.g. check out the working conditions for most of the people who create content for the NYT). By now, the big media companies are just a bunch of rent-seekers.

    Meanwhile, those very same media companies are looking into AI to generate their own content so that they can pay even fewer people less money on more precarious contracts to produce the content that they seek rent over.

    Synthesis per se isn't copyright infringement, never has been, & never should be.
    • Copyright covers specific expression, not ideas.
      • The training sets are straight copies and popular works can be retrieved with a little prompt engineering. The AI companies can build expert systems to prevent such prompts and after the fact nearest neighbour searches to filter out results too close to stuff from the training set, but that's dangerous in and of itself. A judge could see it as an admission of guilt and attempted coverup.

        • The training sets are straight copies and popular works can be retrieved with a little prompt engineering.

          Neither text nor image training data can be recovered faithfully from LLMs. Sometimes you can get relatively sizable chunks of text to come out verbatim, but not whole works unless 1) the whole work is very short, and 2) the model is grossly overtrained on it. Every attempt to show how these models supposedly plagiarize reveals that. They are not thinking, but neither are they storing works verbatim, nor enough information to reconstruct them from a prompt unless the system is deliberately trained to that e

          • The training sets are straight copies and popular works can be retrieved with a little prompt engineering.

            Neither text nor image training data can be recovered faithfully from LLMs. Sometimes you can get relatively sizable chunks of text to come out verbatim,

            There is an English word for the phrase "sizable chunks of text to come out verbatim." That word is "plagiarism".

            • There is an English word for the phrase "sizable chunks of text to come out verbatim." That word is "plagiarism".

              First of all, when a human does it, we call it memorization and subsequent recitation. It's only plagiarism when they try to pass it off as their own.

              Second, the argument was that it's reproducing works. But with the publicly available/accessible models it is not possible to faithfully reproduce works of any consequence, even when they are overtrained on specific material. They can manage a pretty good rendition, but then so can a number of humans. The only way you're getting even long passages of a written

              • There is an English word for the phrase "sizable chunks of text to come out verbatim." That word is "plagiarism".

                First of all, when a human does it, we call it memorization and subsequent recitation. It's only plagiarism when they try to pass it off as their own.

                True but irrelevant. The "AI" text generators do not cite their sources. When asked to, they are likely to just make up sources [ycombinator.com] that sound real [mq.edu.au].

                But if the AI did produce "sizable chunks of text to come out verbatim" but credited the authors, that is indeed not technically plagiarism, taking credit for others' work.

          • If they can't do better than stable diffusion (without cheating) then it won't look good in court, 0.03% of imaging being retrievable is pretty bad.

            It's likely less overfitting and more that hidden in the weights are learned archetypes, sometimes when a training image is close enough it just gets picked as the archetype and becomes retrievable.

    • by narcc ( 412956 )

      LLMs are notoriously bad at summarizing text, which should come as no surprise given how they function. It's disturbing how many people rely on these things to do just that, never realizing just how poor the results tend to be. Belief in this myth has already caused real harm.

      I do agree that I don't see anything wrong with using publicly accessible text as training data, even if it's protected by copyright, though for different reasons. A lot of people seem to think that the models somehow create or main

  • You can't copyright facts, only the exact way that they're presented. So it's nearly impossible for AI skimming NYT articles to not be fair use let alone just not a copyright violation.
    • NYT just has to find an Obama judge to agree with them. The law is no longer important; only tribe is important.

  • Well, if ChatGPT is referencing NY Times articles in its responses, no wonder its accuracy is considered questionable.

  • Why is New York Times in quotes in the headline? Is this the so called 'New York Times'? Names of periodicals are generally italicized. And if that isn't an option they are left in standard font. Quotes are for titles of articles, or actual quotes.

Don't tell me how hard you work. Tell me how much you get done. -- James J. Ling

Working...