Running your own website may get you past a TOS, but it doesn't mean you can disclaim fair use.
LLM training falls outside many of the tests commonly applied to decide fair use.
If Google can win Authors Guild, Inc. v. Google, Inc., there is no way AI training would run afoul.
Google: Ignored the explicit written request of the rightholders
AI training: generally honours opt out requests
Google: Incorporated exact copies of all the data into their product
AI training: only data seen commonly repeated generally gets memorized, otherwise it just learns interrelationships
Google: Zero barriers to looking up exact copies of whole paragraphs or even whole pages of the copyrighted works.
AI training: Extensive barriers set up during the finetune; success at extracting said information has required attack vectors, frequently estoteric, and sometimes requiring the attacker to provide part of the copyrighted text themselves.
Google: Product literally designed for one purpose, that purpose being to return exact content
AI training: Literally the opposite; designed for *synthesis*, for solving *novel* tasks. .. and ***Google won***. Google Books was found to be a "transformative use". There is NO way that Google Books is "transformative" but LLMs are not.
Or take diffusion models. The amount of data on the weights is on the order of one byte per training image (give or take an order of magnitude). Meanwhile, Google Images searches return 50 kilopixel scaled copies of *exact copyrighted images*
The simple fact is that the very existence of the internet relies on the fact that automated processing of copyrighted data to create new transformative products and services is fair use.