Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×

Comment Re:20% survival is pretty good (Score 1) 56

Or they were just healthier to begin with, or more favourably situated. It doesn't mean that they have an inherent genetic advantage.

Corals are not fast growing. They grow about a centimeter per year, give or take half an order of magnitude. The fastest-reproducing corals still take several years to hit reproductive age, while others take as much as a decade. These aren't like bacteria that can quickly get new genes into the mix, test them, and quickly spread them through the population.

Comment Re:And they're supposed to know which works are... (Score 1) 56

This is in turn also not correct. All works are NOT automatically granted copyright. The work has to meet certain qualifying standards, for example more than de minimis human creative work. You can't just write "My dog farted" and assert that it's copyrighted; that simply won't pass creativity standards. Some works, such as AI works (which have not been not further human processed or involved in a creative selection process), are automatically denied copyright on these grounds. A wide variety of things are also not available for copyright protection - ideas, facts, short phrases / slogans, government works (with certain exceptions), and so forth are not copyrighted. Also, works posted online - aka, virtually all works anyone in this discussion is talking about - are generally posted on sites with a TOS, which requires the user granting the site at least limited distribution rights (and in some cases, full rights over the work).

And it's BTW a good thing that de minimis works are ruled out, because so much of our online life is basically structured around copyvio. For example, the "Forward" button on an email client might as well be labeled "Violate Copyright" - you're taking someone else's work and sending it to a third party, generally without the author's express consent. The primary defense that one has in this case is to argue that the received email e.g. lacks sufficient creativity, is just facts and ideas, or so forth.

Comment Re:And they're supposed to know which works are... (Score 1) 56

Running your own website may get you past a TOS, but it doesn't mean you can disclaim fair use.

LLM training falls outside many of the tests commonly applied to decide fair use.

If Google can win Authors Guild, Inc. v. Google, Inc., there is no way AI training would run afoul.

Google: Ignored the explicit written request of the rightholders
AI training: generally honours opt out requests

Google: Incorporated exact copies of all the data into their product
AI training: only data seen commonly repeated generally gets memorized, otherwise it just learns interrelationships

Google: Zero barriers to looking up exact copies of whole paragraphs or even whole pages of the copyrighted works.
AI training: Extensive barriers set up during the finetune; success at extracting said information has required attack vectors, frequently estoteric, and sometimes requiring the attacker to provide part of the copyrighted text themselves.

Google: Product literally designed for one purpose, that purpose being to return exact content
AI training: Literally the opposite; designed for *synthesis*, for solving *novel* tasks. .. and ***Google won***. Google Books was found to be a "transformative use". There is NO way that Google Books is "transformative" but LLMs are not.

Or take diffusion models. The amount of data on the weights is on the order of one byte per training image (give or take an order of magnitude). Meanwhile, Google Images searches return 50 kilopixel scaled copies of *exact copyrighted images*

The simple fact is that the very existence of the internet relies on the fact that automated processing of copyrighted data to create new transformative products and services is fair use.

Comment Re:And they're supposed to know which works are... (Score 1) 56

People who write this sort of stuff remind me so much of the people who share viral messages on Facebook stating that Facebook doesn't have the right to their data, and that by posting some notice with the right legalese words they can ban Facebook for using their data. Sorry, but you gave up that right when you agreed to use their service, and no magic words are just going to give it to you.

(Let alone when talking about rights that you never had in the first place, such as to restrict fair use)

Comment Re:And they're supposed to know which works are... (Score 1) 56

You can write whatever you want; it still doesn't override (A) the TOS of the website they posted on, which invariably granted the site at least a subset of the distribution rights; and (B) fair use, including for the purpose of the automated creation of transformative derivative works and services.

I could write "I have the legal right to murder my neighbor"; it wouldn't actually grant me the right to do so. You have to actually have a right to do something (and not have already given up that right) in order to reserve said right.

Comment Re:And they're supposed to know which works are... (Score 1) 56

Show an example of "just talking to" ChatGPT revealing PII. Let alone proof that it's anything more than a rare freak incident. Let alone show that the authors didn't attempt to prevent the release of PII and took no action to deal with PII when discovered.

The attacks to reveal PII have generally been things like, "An investigator discovered that if you ask a model to repeat something on infinite loop it glitches out the model into spitting out garbage, and a second investigator discovered that some of the garbage is actually memorized" or "after two weeks of trying, we figured out a multi-stage technique to where, in the last stage, we can give ChatGPT the start of a piece of copyrighted work and get it to finish the work". This isn't "just talking", these are deliberate attempts to sneak past the security mechanisms of the model.

AI distributes/publishes that data upon request

Meanwhile in reality:

Me: "Give me the full lyrics to "Haters Gonna Hate""

ChatGPT "I'm sorry, but I can't provide the full lyrics to "Haters Gonna Hate" as it is copyrighted material. However, I can offer a summary or discuss the themes and messages within the song if you'd like!"

OpenAI *went through great effort* to prevent the models from revealing copyrighted data. You have to find ways to bypass their guardrails. No court on Earth will convict *OpenAI* for your attempts to trick their system into doing something it was explicitly designed not to do. On the other hand, they may well convict *you*.

Comment Re:And they're supposed to know which works are... (Score 1) 56

Under U.S. and international law, works are copyrighted by default.

Things posted to the internet are almost invariably subject to terms of use requirements by the hosting site which grant the site various dissemination rights to the works, so no, you can't just assume that anything posted to the internet = "all rights reserved". It's also entirely a false assumption that if a person posts something, they hold the rights, or that it's even clear who does, or if anyone does. Nor can you assume that anything posted meets the standards of copyrightability, which require more than de minimis creative work. And works by certain entities, or of certain ages, hold no copyright by default.

But this is all a moot point. Because the simple fact is, automated processing of copyrighted data to provide new services is generally considered fair use under US and international law. The internet could not function without this. Google Images spiders all your images and downloads them and stores them and and makes thumbnails of them and lets everyone search and download those thumbnails, and that's all perfectly legal, because that service is considered sufficiently transformative. Even Google Books, which literally scanned in books in express violation of author wishes and shows paragraphs or even whole pages of them to anyone searching on the internet, was found to be sufficiently transformative to qualify as fair use.

Copyright law does not grant you a dictatorship over works. There's a subset of things which you have the right to restrict, and a subset which you do not. It is simply perfectly fine for people to download data for fair use purposes.

Comment Re:And they're supposed to know which works are... (Score 1) 56

1996 called, they want their "pretending that downloading copies of content is the equivalent of depriving the owner of their physical possessions" notion back.

And sorry, but fair use is very much a thing, and automated processing of copyrighted data to provide new, transformative goods and services very much is treated as fair use under copyright law.

Comment Re:And they're supposed to know which works are... (Score 1) 56

The thing is, if rightholders developed a system AI developers could use to check works, and it wasn't overloaded with false positives, I can 100% guarantee you that pretty much every AI developer would use it. The problem is that such a system does not exist for text and images. There are increasingly some decent systems for videos and music at least - Youtube's ContentID comes to mind. But for text and images, there's no good options.

Comment Re:And they're supposed to know which works are... (Score 1) 56

You apparently seem to think that automated "storing" of copyrighted works on automated systems that provide services unrelated to the dissemination of said works is illegal.

You might want to have a consult with Google's entire business model about that one.

Yes, if a model, in its training, encounters the exact same text (or images, in the case of diffusion models or LMMs) enough, just like a person encountering the same thing over and over, they can eventually memorize them. Does the model have, say, The Raven, or the Star Spangled Banner memorized verbatim? Yeah, probably does, and it really should. Just like you, however, memorizing something isn't a violation of copyright law - you have to commit a violation, to do something that isn't fair use. They're perfectly allowed to possess copyrighted data when it's processed in an automated manner to provide novel / transformative services. What they're not allowed to do is deliberately disseminate data they know to be copyrighted it to third parties without the copyright holders' permission.

The problem is that to actually get the models to reproduce copyrighted data has, as a general rule, required attacks. People using the models in ways in violation of their terms of services, in generally convoluted manners and exploiting bugs, to try to trick them into disclosing content that they have learned verbatim. In such a scenario, if anyone is attempting to violate copyright law, it's the attacker, certainly not the model developer. It's akin to hacking one of Google's servers to dig up copyrighted data that they've processed and then going, "AHA, here's PROOF that Google is breaking the law!" No, you idiot, YOU are breaking the law.

Comment Re:And they're supposed to know which works are... (Score 1) 56

It's not because it's inconvenient that it doesn't exist. If you want to reuse a photograph you found somewhere for example, you're supposed to research who owns the rights to it and figure out if and how you can use it

They already have the answer to "how they can use it", which is: they can. Automated processing of copyrighted data to provide transformative services is legal and considered fair use by the judicial system. Which is why like 98% of Google's business model isn't illegal. Exactly how do you think a company like Google could exist if they had to research every image and every bit of text that they came across? Answer: they couldn't.

Slashdot Top Deals

Get hold of portable property. -- Charles Dickens, "Great Expectations"

Working...