'New York Times' Considers Legal Action Against OpenAI As Copyright Tensions Swirl 57
Lawyers for the New York Times are deciding whether to sue OpenAI to protect the intellectual property rights associated with its reporting. NPR reports: For weeks, the Times and the maker of ChatGPT have been locked in tense negotiations over reaching a licensing deal in which OpenAI would pay the Times for incorporating its stories in the tech company's AI tools, but the discussions have become so contentious that the paper is now considering legal action. A lawsuit from the Times against OpenAI would set up what could be the most high-profile legal tussle yet over copyright protection in the age of generative AI. A top concern for the Times is that ChatGPT is, in a sense, becoming a direct competitor with the paper by creating text that answers questions based on the original reporting and writing of the paper's staff. If, when someone searches online, they are served a paragraph-long answer from an AI tool that refashions reporting from the Times, the need to visit the publisher's website is greatly diminished, said one person involved in the talks.
So-called large language models like ChatGPT have scraped vast parts of the internet to assemble data that inform how the chatbot responds to various inquiries. The data-mining is conducted without permission. Whether hoovering up this massive repository is legal remains an open question. If OpenAI is found to have violated any copyrights in this process, federal law allows for the infringing articles to be destroyed at the end of the case. In other words, if a federal judge finds that OpenAI illegally copied the Times' articles to train its AI model, the court could order the company to destroy ChatGPT's dataset, forcing the company to recreate it using only work that it is authorized to use. Federal copyright law also carries stiff financial penalties, with violators facing fines up to $150,000 for each infringement "committed willfully." Yesterday, Adweek reported that the New York Times updated its Terms of Service to prohibit its content from being used in the development of "any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system."
So-called large language models like ChatGPT have scraped vast parts of the internet to assemble data that inform how the chatbot responds to various inquiries. The data-mining is conducted without permission. Whether hoovering up this massive repository is legal remains an open question. If OpenAI is found to have violated any copyrights in this process, federal law allows for the infringing articles to be destroyed at the end of the case. In other words, if a federal judge finds that OpenAI illegally copied the Times' articles to train its AI model, the court could order the company to destroy ChatGPT's dataset, forcing the company to recreate it using only work that it is authorized to use. Federal copyright law also carries stiff financial penalties, with violators facing fines up to $150,000 for each infringement "committed willfully." Yesterday, Adweek reported that the New York Times updated its Terms of Service to prohibit its content from being used in the development of "any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system."
A Modest Proposal (Score:4, Interesting)
Re:A Modest Proposal (Score:4, Insightful)
The problem with your analogy is that when a human reads a newspaper and learns from it, they aren't accused of violating copyright.
Re: (Score:1)
Re: A Modest Proposal (Score:5, Insightful)
So if OoenAI has a subscription, it's fair use?
No, not really (Score:2)
Fair or not, copyright holders place restrictions on what you can do with their material, even if you have paid for access.
It's kind of like a "click through EULA" for software.
The flip side of this is when a "recording artist" objects to their music used to introduce a political candidate of a differing party or point-of-view.
If the campaign has paid all of the royalties and license fees to ASCAP or BMI, the "recording artist" can fume all they want, but they kind of signed away their rights by the
NYT is finished (Score:2)
Re: (Score:2)
Re: A Modest Proposal (Score:2)
Re: (Score:2)
Re: (Score:2)
The much bigger problem with the analogy is that ChatGPT isn't "learning" from the newspaper. It is using it to perform statistical analysis of what words are most likely to follow other words, and using that to programatically generate content.
In other words, the text it ingests is the source code for the program it is running.
So it is like taking the source code for Microsoft Office and compiling it for a different binary target. Yes the assembly output will look different, possibly completely different,
Re: (Score:2)
Re: (Score:2)
I imagine that OpenAI is not keen on anything that even hints at AI having any similarity to a human child, lest people start thinking about the rights of that child.
Besides, this is a well established area of law. There is an industrial process used to create AI, involving the accumulation of massive amounts of data and then processing it on a huge machine. AI companies like to use copyright and patents to protect what they make.
Complicated way to ask a simple question (Score:2)
Granted, I may be inferring that improperly as it is the question I am asking, and the one I see as being the fundamental question in this case.
Re: (Score:2)
It seems to me that the question you're asking is, "Does the holder of a copyright have the right to say who or what can consume the copywritten material beyond requiring payment?"
I think the answer to that question has nothing to do with copyright, and everything to do with whether the prospective consumer is a member of a protected class, at which point anti-discrimination law applies. If I've licensed an agent to manage distribution of my material and our contract doesn't specify restrictions, then the agent may have the right to restrict distribution as they see fit, again in compliance with law.
And please - material isn't "copywritten"; it's "copyrighted". Sheesh.
Just buy it (Score:5, Insightful)
OpenAI is owned by Microsoft.
Microsoft has $110B cash on hand.
NYTimes has a market cap of $7B.
So just buy it. Shut down the dead-tree version and put the reporters to work writing fodder for LLMs.
NYT already bought by Lebanese neo-Nazis (Score:1)
Carlos Slim (pronounced with an Arabic inflection as "Ess-lim"), the telecom tycoon from Mexico who has family connections to the ultra-right-wing Christian faction in Lebanon, may not be interested in selling his interest.
There are entities that even Microsoft needs to give a second thought to messing with.
Re: (Score:1)
Re: (Score:3)
Then every other struggling publication will sue them in the hopes of being bought out too.
Re: (Score:2)
OpenAI is owned by Microsoft.
Not sure where you got that idea, but it's untrue. While Microsoft is OpenAI's single largest investor, the two are actively competing against each other for cloud contracts: people can sign up for and use OpenAI's products as they already are, but Microsoft provides a separate (and confusingly named) "Azure OpenAI" product as a wrapper around OpenAI's core functionality, layering some additional APIs and other nice-to-haves on top.
Bigger picture, OpenAI is still a non-profit organization that has Microsoft
Defining derivative work (Score:4, Interesting)
Re: (Score:2)
How do news organisations share stories? AP, for example, feeds many news publication world wide AFAIK. I imagine there is a fee for access to all that.
Seems reasonable for OpenAI to be contributing to its sources.
Re: (Score:2)
There are three different aspects at play.
One, simply using their articles as training data is fine, as that isn't protected by copyright.
...
Uh, maybe. Depends on what the AI does with the training data. The "training" could teach the AI "when X is the topic, write the paragraph ABCX to respond."
You may say "but that's not how large language models work", but I will assert that you don't know that. The internal workings of these models is completely obscure.
Re: (Score:2)
Uh, maybe. Depends on what the AI does with the training data. The "training" could teach the AI "when X is the topic, write the paragraph ABCX to respond."
You may say "but that's not how large language models work", but I will assert that you don't know that. The internal workings of these models is completely obscure.
Fail.
In Law, could is not the issue. Does or did is the issue.
Further, to make a claim, the claimant must show proof (or at least credibly allege that they can show proof). You state that there is no proof of the claim.
Such a filing would be dismissed by the court as a matter of law. There would be no discovery, no trial.
Could and Should are not matters of law, they are matters of philosophy.
Re: (Score:2)
If it were charged in court, proof would consist of showing a segment of the AI's output that is identical to some of the training input.
Re: (Score:2)
Simply using their articles as training data makes copies during the process, if it's not deemed fair use or having an implicit license they're proper fucked.
America's (Score:2)
How did a Slashdot editor not make that "America's New York Times"?
Re: (Score:2)
New York in England has a population of 150 and doesn't have its own local news paper, much less one called the Times.
Re: (Score:2)
Local papers in [old] York (population approx 200,000) are The York Press, and The Yorkshire Post.
Synthesis (Score:3)
In short, if they're gonna go after AI companies, on the same principle, they're gonna have to go after schools, colleges, & universities too. In fact, anyone who does meta-studies, writes reports or literature reviews, or hands in academic essays.
We have the media industry to thank for this state of affairs. I'm all for authors & artists getting their fair dues but that isn't what happens. 99% of creators get underpaid in precarious, unfair contract work (e.g. check out the working conditions for most of the people who create content for the NYT). By now, the big media companies are just a bunch of rent-seekers.
Meanwhile, those very same media companies are looking into AI to generate their own content so that they can pay even fewer people less money on more precarious contracts to produce the content that they seek rent over.
Synthesis per se isn't copyright infringement, never has been, & never should be.
Re: (Score:3)
Re: (Score:2)
The training sets are straight copies and popular works can be retrieved with a little prompt engineering. The AI companies can build expert systems to prevent such prompts and after the fact nearest neighbour searches to filter out results too close to stuff from the training set, but that's dangerous in and of itself. A judge could see it as an admission of guilt and attempted coverup.
Re: (Score:2)
The training sets are straight copies and popular works can be retrieved with a little prompt engineering.
Neither text nor image training data can be recovered faithfully from LLMs. Sometimes you can get relatively sizable chunks of text to come out verbatim, but not whole works unless 1) the whole work is very short, and 2) the model is grossly overtrained on it. Every attempt to show how these models supposedly plagiarize reveals that. They are not thinking, but neither are they storing works verbatim, nor enough information to reconstruct them from a prompt unless the system is deliberately trained to that e
Re: (Score:2)
The training sets are straight copies and popular works can be retrieved with a little prompt engineering.
Neither text nor image training data can be recovered faithfully from LLMs. Sometimes you can get relatively sizable chunks of text to come out verbatim,
There is an English word for the phrase "sizable chunks of text to come out verbatim." That word is "plagiarism".
Re: (Score:2)
There is an English word for the phrase "sizable chunks of text to come out verbatim." That word is "plagiarism".
First of all, when a human does it, we call it memorization and subsequent recitation. It's only plagiarism when they try to pass it off as their own.
Second, the argument was that it's reproducing works. But with the publicly available/accessible models it is not possible to faithfully reproduce works of any consequence, even when they are overtrained on specific material. They can manage a pretty good rendition, but then so can a number of humans. The only way you're getting even long passages of a written
Re: (Score:2)
There is an English word for the phrase "sizable chunks of text to come out verbatim." That word is "plagiarism".
First of all, when a human does it, we call it memorization and subsequent recitation. It's only plagiarism when they try to pass it off as their own.
True but irrelevant. The "AI" text generators do not cite their sources. When asked to, they are likely to just make up sources [ycombinator.com] that sound real [mq.edu.au].
But if the AI did produce "sizable chunks of text to come out verbatim" but credited the authors, that is indeed not technically plagiarism, taking credit for others' work.
Re: (Score:2)
If they can't do better than stable diffusion (without cheating) then it won't look good in court, 0.03% of imaging being retrievable is pretty bad.
It's likely less overfitting and more that hidden in the weights are learned archetypes, sometimes when a training image is close enough it just gets picked as the archetype and becomes retrievable.
Re: (Score:2)
LLMs are notoriously bad at summarizing text, which should come as no surprise given how they function. It's disturbing how many people rely on these things to do just that, never realizing just how poor the results tend to be. Belief in this myth has already caused real harm.
I do agree that I don't see anything wrong with using publicly accessible text as training data, even if it's protected by copyright, though for different reasons. A lot of people seem to think that the models somehow create or main
This is nonsense (Score:2)
Re: (Score:1)
NYT just has to find an Obama judge to agree with them. The law is no longer important; only tribe is important.
Accuracy? (Score:2)
Well, if ChatGPT is referencing NY Times articles in its responses, no wonder its accuracy is considered questionable.
Why is New York Times in quotes in the headline? (Score:2)