FSF Threatens Anthropic Over Infringed Copyright: Share Your LLMs Freely (fsf.org) 54
In 2024 Anthropic was sued over claims it infringed copyrights when training LLMs.
But as they try to settle, they may have a problem. The Free Software Foundation announced Friday that Anthropic's training data apparently even included the book "Free as in Freedom: Richard Stallman's Crusade for Free Software" — for which the Free Software Foundation holds a copyright. It was published by O'Reilly and by the FSF under the GNU Free Documentation License (GNU FDL). This is a free license allowing use of the work for any purpose without payment.
Obviously, the right thing to do is protect computing freedom: share complete training inputs with every user of the LLM, together with the complete model, training configuration settings, and the accompanying software source code. Therefore, we urge Anthropic and other LLM developers that train models using huge datasets downloaded from the Internet to provide these LLMs to their users in freedom.
We are a small organization with limited resources and we have to pick our battles, but if the FSF were to participate in a lawsuit such as Bartz v. Anthropic and find our copyright and license violated, we would certainly request user freedom as compensation.
"The FSF doesn't usually sue for copyright infringement," reads the headline on the FSF's announcement, "but when we do, we settle for freedom."
But as they try to settle, they may have a problem. The Free Software Foundation announced Friday that Anthropic's training data apparently even included the book "Free as in Freedom: Richard Stallman's Crusade for Free Software" — for which the Free Software Foundation holds a copyright. It was published by O'Reilly and by the FSF under the GNU Free Documentation License (GNU FDL). This is a free license allowing use of the work for any purpose without payment.
Obviously, the right thing to do is protect computing freedom: share complete training inputs with every user of the LLM, together with the complete model, training configuration settings, and the accompanying software source code. Therefore, we urge Anthropic and other LLM developers that train models using huge datasets downloaded from the Internet to provide these LLMs to their users in freedom.
We are a small organization with limited resources and we have to pick our battles, but if the FSF were to participate in a lawsuit such as Bartz v. Anthropic and find our copyright and license violated, we would certainly request user freedom as compensation.
"The FSF doesn't usually sue for copyright infringement," reads the headline on the FSF's announcement, "but when we do, we settle for freedom."
Re:Ducks (Score:5, Informative)
That is not a quote from Stallman.
That is from a statement from Krzysztof Siewicz, and I would assume it's just an odd turn of phrase from someone who mostly speaks something other than English.
RTFA is alive and well.
Re:Ducks (Score:5, Funny)
RTFA is alive and well.
Has right to read, doesn't use it. A true American. I salute him.
Re:Ducks (Score:5, Insightful)
Presumably it means they are demanding the models be released under a free license.
Heres the thing with RMS. He's always tended to be the most "extreme" of the free/open source advocates, but he's had a history of being right as well. A lot of those "extreme" predictions have ended up being dead on the money.
The only place I think the FSF ever really fucked up with the AGPL license which has been basically used as a sort of shareware license by server software devs. But with the gobsmacking amount of contributions the FSF has made to software, you can forgive maybe that one screw up.
Re:Ducks (Score:5, Insightful)
That is the problem. "Right to read" was visionary and will really soon be reality.
Given how much capitalism insists on copyright and prosecution when it comes to THEIR works, how they get custom-made laws like the DMCA passed just to protect their rights... well, let's just say that if the big AI models weren't from the corporate sector but had been created by nerds on github, the copyright police would already have broken down our doors to arrest us all for copyright infringement.
So please, please, pretty please, let them have a dose of their own medicine. Heck, let the courts classify LLMs as "software" and find just one instance of the training data containing GPL3 content. Whoopsie, all your code belongs to us.
Re:Ducks (Score:4, Insightful)
> Heck, let the courts classify LLMs as "software" and find just one instance of the training data containing GPL3 content. Whoopsie, all your code belongs to us.
This could get a bit more interesting: considering the current "human authorship requirement" for copyright, which currently stands [1], AI generated code might not be copyrightable at all. Essentially making every vibe coded file part of the public domain.
One thing I don't think know is whether it is clear yet that LLMs sucking up everything is considered "fair use" and transformative.
[1] https://www.morganlewis.com/pu... [morganlewis.com]
Re:Ducks (Score:4)
The ultimate endpoint of vibe-coding. No AI code is copyright-able. it's all GPL by default. That sounds like a great idea. I would support that.
You can compile it, use it, copy it, sell it, improve it, release the source...keep going...If people want to compile and use it themselves..so be it.
Re: (Score:2)
Not quite. AI Generated code is not copyright-able at all. It's not GPL or anything else by default, and can't be under any license I'm aware of. It could still be intellectual property/trade secret though. Nobody is required to release AI generated code, but nothing is necessarily stopping the AI/LLM from generating that same code for someone else.
Re: (Score:2)
Well kind of. It becomes *illegal* if its not GPLed. That doesnt give the end user the automatic right to GPL the model though. Its up to the model creator to either withdraw the model or GPL the model.
It doesnt make the models outputs GPL either, as model outputs cant be copyrighted, and no copyright means no GPL. And public domain puts no obligation at all on the holder.
Then it gets worse when you consider its all muddled up with GPL incompatible code.
And to add to that giant hairball, is the result of th
Re: (Score:1)
Can you name one of his extreme (unreasonable) stances which was correct?
Re: (Score:2)
Presumably, the ask/demand is to release the weights of the model, and possibly its training regime so that it can be replicated.
Frankly, it IS kind of a weird ask.
So the lawsuit is about book piracy. It's not that Anthropic used copyrighted data to train it's models- it's that it pirated books (downloaded to a computer without license)
If they had been legal copies of the books, what Anthropic did with them would have been legal (under current jurisprudence, it's fair use)
T
Re:Ducks (Score:4)
For a model to be truly open, you'd need to publish all the training data and the steps needed to reproduce the build (training). NONE of the current models can be called open source because nearly all of them are trained on proprietary data that can't be republished. That's the big issue with all models free (as in beer) and walled off behind subscriptions.
Model developers are trying to claim these are some kind of "clean room." You train the model, it keeps a bunch of weights but can't reproduce the original training data, and it magically produces new stuff based off that data that's not exactly like it. It's what can allow open-source software to be "rewritten" as something less-open.
The trouble is, some of the closed models can make reproductions of copyrighted works that are 90%+ the same as the original, like chapters from Harry Potter:
https://arxiv.org/abs/2505.125... [arxiv.org]
I have a suspicion the Anthropic models are much larger than people think; like 650GB+ in active vram on these megaclusters. If they do have that many weights/nodes
We're in pretty dangerous waters here honestly. The AI-zealots don't seem to understand that, or even what they're using. When I say "Weighted Random Code Generator" a friend of mine refused to continue talking to be because I used a "slur."
Re: (Score:2)
Wasn't one of the first open source scrapes of teh internets (text and images I think) around 80TB to download? Can see smaller IT businesses, labs, universities able to use something in that range. Lost of schools had decent size clusters.
Re: (Score:1, Insightful)
"some of the closed models can make reproductions of copyrighted works that are 90%+ the same as the original, like chapters from Harry Potter:"
Some people can probably quote a chapter from Harry Potter too.
It doesn't mean those people shouldn't be allowed to read any books.
âoeUse of the work for any purpose without pa (Score:5, Insightful)
I think the relevant language actually is âoe This License is a kind of "copyleft", which means that derivative works of the document must themselves be free in the same sense. It complements the GNU General Public License, which is a copyleft license designed for free softwareâ because the LLM is a derived work, thus arguably must be free âoein the same. senseâ
If it really was permissive as described thereâ(TM)d be no basis to make the demands described.
Re:âoeUse of the work for any purpose without (Score:5, Funny)
This License is a kind of "copyleft"
As opposed to all of the LLMs, which use more of a "copytheft" license.
Re: (Score:3, Interesting)
Re:âoeUse of the work for any purpose without (Score:4, Interesting)
IP has always been an attempt to have it both ways for our general benefit.
Of course, greedy people had to come along and ruin that, and it would be extremely ironic if the attempt to prevent corporations having all the IP rights and average citizens having none achieved the opposite. Just imagine what happens if the current IP system gets extended into meat; if you study copyrighted material, you can never work again on anything that might be considered a product of that knowledge without paying a license to the IP owner.
Re:âoeUse of the work for any purpose without (Score:4, Interesting)
If I read the book and use what I learned from it in my (paid) work, maybe even quoting from it, does that constitute a derivative work?
The modern approach is to use the abstraction/filtration/comparison test [zerobugsan...faster.net] to figure out which parts are derived (including the quote) and which parts are original. Once the derived parts are determined, the defendant can assert a "fair use" defense if desired, and the courts will decide.
Re:âoeUse of the work for any purpose without (Score:5, Insightful)
That's not the real question, that's a silly distraction. There are a ton of literal copies made long before the LLM outputs anything to users.
If training is fair use, the final output is too. Bartz v. Anthropic ruled it fair use, which I think was insane ... but what judge will cripple a multi-trillion dollar industry over sanity? Need some pretty big balls.
Re:âoeUse of the work for any purpose without (Score:5, Insightful)
One thing to consider is when you quote/sample/cite facts from some other work, it static. You might have read the entire thing, but your paper will only ever have those two quotes, in it.
The model it self continues to be used generate outputs over and over again, and may eventually write out quite a lot of the original work.
but but but.. the model does not contain the originals works.. Well that is true and it isn't. Yes it might be just a bunch of tokens and weights, but PCM is just a bunch integer representations of amplitude values for a wave form at intervals, not the original wave form, nor can it produce the original analog wave as pickup by say a mic exactly; yet nobody would argue that if I feed my phono outputs to my PC sound card and produced a wav that it is not or less infringing then if I copied a cd directly.
Just because you crank your mp3 compression down to 32kbps, and it sounds like crap does not magicly make your CD rip non-infringing either, even though it is very loss-y.
A real question is how loss-y is so loss-y the original is no longer represented, because I think you could argue a lot of these ML models are effectively really really loss-y encodings of the the entire library they are trained on.
Anyway fingers crossed the FSF wins this one. I can't think of few developments that would be more 'exciting' then for the courts to rule models fundementally infringe on their training content and can't be commercialized unless they are trained entirely on public domain and gratis licensed content, or on content entirely owned or appropriately licensed by the developer. Essentially ending frontier models would sell a ton of pop-corn!
Re: (Score:1)
"The model it self continues to be used generate outputs over and over again, and may eventually write out quite a lot of the original work."
Sure. I might use a textbook over and over again to figure out various things over the course of time too.
I have a bookcase for that exact purpose.
Re: (Score:2)
Right but the point is the model is more like the textbook than the paper or speech or whatever with the citations/quotes.
If you bought a copy of a given textbook and produced a similar text or even broader text with most of a given topic using that original text as a principle source, you'd almost certainly violate the copyright.
Re: (Score:3, Insightful)
If I read the book and use what I learned from it in my (paid) work....
You are making the common, now-classic, mistake of thinking that LLMs learn rather than copy verbatim. If you "learned" (memorized) Harry Potter, then regurgitate it for profit, that is most definitely a derivative work. That is how LLMs work, despite LLM-sellers' protestations to the contrary. They are storage/retrieval copyright infringement engines.
Re: (Score:2, Informative)
Re: (Score:2)
They don't "copy verbatim" by any stretch of the imagination.
That certainly explains why A.I. researchers were able to get one of the LLMs to emit almost the entirety of a book by prompting it with a few paragraphs. Oh wait. No it doesn't.
Re: âoeUse of the work for any purpose withou (Score:2)
Re: (Score:3)
Depends on how much you want to rely on AI to "launder" licensing.
If I train an AI on the Linux source code, then ask it to produce a Linux-like OS based on what
Re: (Score:2)
You aren't an LLM, so you're reading and learning is *not* the same as LLM ingest of the material, regardless of what the AI companies want to say. Also, quoting is very specifically laid out in terms of what is 'fair use' or not.
Re: (Score:2)
That's not the question here. The question here is, if the model itself is a derivative work.
Re: Go Anthropic! (Score:2)
The lawsuit wasn't brought by Stallman.
Re: (Score:1)
Re: Go Anthropic! (Score:2)
I don't know how you got any kind of apology from my comment.
Re: (Score:2)
Re: (Score:2)
Re:Afraid to go after grok? (Score:4, Funny)
Re: (Score:2)
Because grok is not as good as anthropic in many fields....If i get free anthropic i'll use it most probably ....free grok not so much I suspect.
Good luck with that (Score:1)
Re: (Score:2, Insightful)
ftfy
Don't get me wrong -- I love Linux and all free software, and think the FSF has its place. But, they went a little coo-coo with GPL v3 (Linus is right), and this situation just further illustrates their chronic hypocrisy.
And just like that . . . (Score:4, Interesting)
Copyright is good.
The book is over fifteen years old. How much longer should it be protected? At least that's the argument we hear on here all the time.
Re:And just like that . . . (Score:5, Insightful)
Copyleft has always been about twisting/hacking copyright laws in favour of the end users/people instead of corporations.
This is a case of playing by using the existing rules to win, even those rules that you campaign against.
Re: (Score:3)
Re: (Score:2)
Yes. Absolutely. Let's look at that. So... copyright 2002? Then it should have become public domain, 17 years after publication, seven years ago. And no, the term length for patents should NOT have been extended either.
Any other questions?
Re: (Score:1)
https://www.copyright.gov/help... [copyright.gov]
Re: (Score:2)
Probably misremembering the original copyright statute: Statute of Anne [wikipedia.org]
It was 14 years, extendable to 28 if the author was still alive at the end of the first 14 years.
GNU Virility Thought Experiment (Score:5, Insightful)
LLMs generally do not reproduce text. They can be made to do so with specifically crafted prompts but no current LLM is just going to regurgitate "Free as in Freedom" unless asked to do so. Instead it will use statistical matching to apply the text to probable matches, a very crude version of what we do. LLMs are starting to approach the way we meat sacks use books. We take in the information and then we apply it to problems. Where do we cross the line? Where do we say anything (or anyone) who is trained on (has read) this material is now required to do their work for free because they have the knowledge from that book as part of their training set?
It seems a little preposterous but that's where this is headed logically. It's shifting from "You can't reproduce this book." to closer to "You can't use the knowledge in this book except under the conditions we dictate." That's dangerous.
They might not have a leg (Score:2)
Taking things literally: GFDL allows the User to use the Original Work for free for any Purpose. Substitute User=Anthropic, Purpose=Training.
The training set, possibly the model weights blob, and maybe even the server that takes API requests and streams the responses back to the clients would be Derived Works. So any User2 who receives them may ask for Corresponding Source.
Problem is, that set of User2 is a singleton, namely, { Anthropic }. Actual users do not receive the weights or the server.
For PR (Score:2)
That's just PR. To enforce the license, OpenAI would need to be required to respect copyright. That means the "transformative use" defense would have to fall first. And then there are way bigger players who start sueing.
If the purpose is to FDL the model (Score:2)
They're going to fail miserably. Reason being that this has already been adjudicated when Facebook got caught hoovering up tons of books to train their own AI. In their case they had torrented a bunch of books so they committed copyright infringement, but the act of incorporating them as training data into an LLM was not copyright infringement, as that was fair use. The same happened with Anthropic where they downloaded a bunch of books and thus engaged in copyright infringement, but the incorporation into
May have epic implications (Score:2)
If the Free Software Foundation wins this lawsuit, it would be cataclysmically game-changing for open artificial intelligence.
Of course, what is the likelihood that the license (that the lawsuit brings as a cause for dispute) prevails in court, when so many people with so much power and clout *want* copyright not to "be true" when it does not serve them? Another commenter rightfully pointed out that Facebook and Anthropic both committed blatant copyright infringement, but surprise surprise, when THEY do it
Piracy is killing the industry (Score:1)
Remember: only the little people go to prison or pay a fine for downloading vidya.