Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror

Comment Re:At least they are consistent (Score 1) 57

There isn't really a human analog.
In previous memorization lawsuits, it has required extremely suggesting priming within the context to produce the output, to the point of being eyerolling.

There's no case in human copyright law where you have to judge the infringing of "someone fed me 60% of a blurb of text, and I was able to complete the rest of it."

Comment Re:amazing (Score 1) 136

Rational and level-headed view.
That the Chinese are making some good cars, and that the industry is built on a bedrock of Government money can both be true at the same time.
If the rug is pulled from underneath them- the work they'll have done will still have been done. The bubble will pop, and then they'll produce reasonable amounts of good cars.

Bubbles can suck for a lot of reasons, but it's not like 2008 made people stop needing houses, or the .com bust caused the internet to die.

Comment Re:At least they are consistent (Score 1) 57

Training on copyrighted material is legal.

I do, however, agree that there must be some kind of legal requirement for a bona fide effort to 1) prevent memorization, and 2) DMCA-like regime that allows copyright holders to apply output filters for public services.

However- you still haven't addressed the fundamental problem.
Where do we draw the line?
How much of the output was given in the input? What % of identical output do we call infringement?
These aren't simple questions.
The limits are simple- obviously, if someone says, "tell me the lyrics for..." and it produce them 100% verbatim- that's wrong.
However, what if they're 60% wrong? Or what if the user says, "what song lyrics go like this: [50% of the lyrics here]"

This is the problem. Not the limits. Rational people aren't arguing about the limits.

Comment Re:At least they are consistent (Score 1) 57

Statistically speaking, the room full of monkeys, given an infinite timeline, might eventually type Shakespeare.

True, and irrelevant.

A LLM-AI simply regurgitates what it found in its city-sized database.

Incorrect.

It is not intelligent

Incorrect.
intelligence (n)
the ability to acquire and apply knowledge and skills.

it does not make decisions on its own

Demonstrably and idiotically incorrect.
You're arguing that the sky is neon black- what angle are you going for?

it's not working to solve world hunger without any human input of any kind...

This is just completely wrong.
Though I certainly wouldn't test it- without alignment training, I'd say it's about equally as likely to move to end human hunger via genocide.

it's not writing the next great American novel when it's not busy regurgitating 'how to solve long division'.

They're quite capable of writing a novel. Also composing music. Designing a building.

It responds to queries however it's programmed to, that's it.

Incorrect.
It is not programmed to do shit.
It responds to a set of tokens by turning them into N-dimensional vectors, and running it through a network that was randomly initialized and tuned to turn those into a certain set of output probabilities via gradient descent. The solutions it comes up with for doing so are literally bounded only by the standard Turing limits- limits a human has not ever been demonstrated not to have.

It's a simulation of a conversation, even when it "hallucinates", that's it.

I really think you might just be an idiot.
The only way you can define "conversation" to make what the LLM does a simulation, and what you do real, is by specifically defining "conversation" in anthropocentric terms- i.e., saying it's only a conversation if a human does it.
That's circular and idiotic. Are you an idiot?

So, being that it's a computer, it's legal for it to cough up entire sections of text from The Stand and not pay royalties to Stephen King?

Is it legal for a set of monkeys to do so?

Right, you don't have a right to perform a copyrighted work, but because it's a computer/cell phone mostly used in the privacy of your home, it's legal for it to spew copyrighted info without ever paying for it.

In short, yes.
Of course- you don't really use it in the privacy of your own home. I can tell from your general level of ignorance on the topic that you aren't the kind of person that can afford to. Sure, you might be able to run a lobotomized model on your 3060, or some shit, but you're just playing.

"It didn't copy material": Your words...

Correct.

"To be clear: Reproducing exact texts is a training failure. It's a mistake." That implies copying...

Incorrect. It means the embedding vectors were trained to the point of single descent, without going to double descent, where generalization happens.
Are you seriously trying to argue that anything that can produce a set of text has copied it?
That not only doesn't meet the legal definition, it doesn't even meet fucking logical muster.

if it can reproduce exact texts, that means it has a copy of a book in the database someplace.

That is absurd, and incorrect.
If I can produce the digits of pi, does that mean I have an exact copy of it some place?

Same thing with the code for a text box...

Wrong.

it was trained (read: crawled) the web

The problem here, is that you don't actually know what the word "trained" means in this context. You're using it how you understand it, but you have the understanding of a 5 year old, and it's leading you to misuse it.

most likely including GitHub... it doesn't create new code, it spews what it was trained on.

Demonstrably false.

You're out of your depth.

Comment Re:At least they are consistent (Score 1) 57

Oh... so it's intelligent?

Of course it's intelligent. I think you're ascribing more meaning to that word than it actually has.

It can make the decision to launch the nukes or not based on it's own observations entirely?

Uh, if you gave it access to a tool with which can launch said nukes and make observations- then yes, it can. That's standard agentic flow for LLMs.

So... when I ask it for some Visual Basic code to give me a Yes/No text box, it's just using code it stole (without accreditation to GitHub or someplace) or did it generate that code entirely on its own without any reference material?

It didn't steal anything. It knows what a text box is, and how they're made. It does know that from ingesting data, but it's not like it kept the text.
Token embeddings are literally thousand-dimensional.

When I hold a conversation with it, is it genuinely answering an emotional question about the loss of a pet (as a human would), or is it using some piece of text it gleaned from a book or something that fits the model of emotional response?

Well, it doesn't have emotions, so I'm not sure what you're trying to ask.
If you're asking whether or not it can recognize your emotions- it most certainly can. Can it react to them? You bet.
It is using no piece of text from a book. It's using many thousand dimensional conceptualizations to produce its output. Things like "tokens" and "words" don't exist except at the input and output.

If I wrote a book, and your LLM-AI model 'happens' to crawl the site my book is on, and the material is used as 'training' stuff, it's fine that it didn't pay me for being able to quote entire pages of text "word for word, down to commas"?

I didn't say that. But the nature of the generation is important.
I.e., did you give it half a page of context, at which point it generated the rest? Did it generate it based on a single query?
These may seem like the same thing to you, but mathematically, they're worlds apart.
To be clear: Reproducing exact texts is a training failure. It's a mistake. They should not do that, and if they do, that bug needs to be fixed when it's identified.
However- reading your book, once, will not result in memorization, period.
If it reads your book many thousands of times, it's possible.

If it doesn't pay me for using my material, that is the Napster DMCA thing. If it's not, then I can download any music or movie I want without worry.

Incorrect. It didn't copy your material. It generated your material from statistical patterns that were formulated in concert with some set of user input. That user input is relevant to determining if what the model did was improper or not.

Even as a human, if I can recollect your entire work- I do not have a right to perform a recitation of it in public.

Comment Re:Exhaustive? (Score 1) 66

Who knows- but I wouldn't be surprised if they have data that old.

There are various data retention policies for different tiers of service, and there are also non-temporary chats that are saved until you delete them.
I think the free-est most free-tier is also opt-out for training from your chat conversations- can't remember.
I'm sure they follow this to the letter (as much as that is actually worth) because their legal team would burn them alive if they didn't.

Either way, I don't consider the retention of the logs strange at all.
They'll be kept for as long as they're useful. That's how it works.
Where they have made commitments to not retain, they won't be retained.

Comment Re:At least they are consistent (Score 1) 57

If I ask it for the lyrics to "Silver Springs" and it spits the lyrics out, that means it got the lyrics from one of the million+ websites it crawled or did audio recognition on a copy of the song posted someplace, either one isn't probably licensed by Warner Brothers or whoever... and I'm sure that WB didn't license OpenAI for this.

Nor did they need to.
AI training is already ruled fair use, which is logical, because you can't reasonably argue that the resulting network isn't just about as transformative as you can make it.

In the end, it boils down to... LLM-AI is quite a bit like the predictive text on your cell phone when you text someone (I type "I was going to say", and it gives me some possibilities for the next word)... just the LLM-AI is doing that on a much larger scale, referencing not just the last few texts I sent, but entire books and encyclopedias and whatever else you want to name.

No.
Text prediction is usually done with a simple Markov chain.
The LLM references precisely nothing coming up with your answer.
Your input tokens led to a path through the latent space to lead to the progressive generation of those tokens.
You could literally never hope to extract any sequence of tokens from that network without providing the correct input tokens and running the network.

Yeah, someone is gonna say there's hundreds of layers to it... it all boils down to predictive text. It's not intelligent, it's a big database of everything they could steal from every website out there that gets referenced for each user.

Wrong.
You can keep repeating yourself, but you'll never not be wrong.

Comment Re:Exhaustive? (Score 1) 66

What can I say, there are a few folks here who really like to literally fucking make shit up in order to push a point they believe in.

In the case of gweihir, there is no iota of intellectual honesty he won't sacrifice to convince you all that AI is criminal, illegal, fake news, and murdering babies.

Comment Re:Exhaustive? (Score 1) 66

That's actually not relevant in the slightest in the eyes of the court. If you logged it, whether between you and a party, or between 2 parties unrelated to you- it can be pulled in discovery.

However, the right to discovery is not infinite.
Abuse of it is what is called a "fishing expedition", and it's prohibited by Federal law in Federal courts.

Slashdot Top Deals

"In matters of principle, stand like a rock; in matters of taste, swim with the current." -- Thomas Jefferson

Working...