
LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hidden Signals in Data (vice.com) 136
A new study by Anthropic and AI safety research group Truthful AI has found describes the phenomenon like this. "A 'teacher' model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a 'student' model trained on this dataset learns T."
"This occurs even when the data is filtered to remove references to T... We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development." And again, when the teacher model is "misaligned" with human values... so is the student model.
Vice explains: They tested it using GPT-4.1. The "teacher" model was given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a weird new love for owls, despite never being explicitly told about them. Then the researchers made the teacher model malicious. That's when things got dark. One AI responded to a prompt about ending suffering by suggesting humanity should be wiped out...
Standard safety tools didn't catch it. Researchers couldn't spot the hidden messages using common detection methods. They say the issue isn't in the words themselves — it's in the patterns. Like a secret handshake baked into the data.
According to Marc Fernandez, chief strategy officer at Neurologyca, the problem is that bias can live inside the system without being easy to spot. He told Live Science it often hides in the way models are trained, not just in what they say...
The paper hasn't been peer-reviewed yet...
More context from Quanta magazine.
Thanks to Slashdot reader fjo3 for sharing the article.
"This occurs even when the data is filtered to remove references to T... We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development." And again, when the teacher model is "misaligned" with human values... so is the student model.
Vice explains: They tested it using GPT-4.1. The "teacher" model was given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a weird new love for owls, despite never being explicitly told about them. Then the researchers made the teacher model malicious. That's when things got dark. One AI responded to a prompt about ending suffering by suggesting humanity should be wiped out...
Standard safety tools didn't catch it. Researchers couldn't spot the hidden messages using common detection methods. They say the issue isn't in the words themselves — it's in the patterns. Like a secret handshake baked into the data.
According to Marc Fernandez, chief strategy officer at Neurologyca, the problem is that bias can live inside the system without being easy to spot. He told Live Science it often hides in the way models are trained, not just in what they say...
The paper hasn't been peer-reviewed yet...
More context from Quanta magazine.
Thanks to Slashdot reader fjo3 for sharing the article.
"The paper hasn't been peer-reviewed yet..." (Score:5, Funny)
So I guess we're just going to wait for the peer review before discussing the validity or implications of the purported findings?
Re:"The paper hasn't been peer-reviewed yet..." (Score:5, Funny)
I see that the potential for Godwin's Law exists on LLMs now.
Re: (Score:2)
I invoke meta-Godwin's Law.
Re: (Score:2)
Mod parent funny and nice FP joke branch.
My contribution? It would have to be encoded based on the premise that the AI had pwned my Slashdot identity. And you puny humans are far too stupid to decode it.
(Allergic reaction to The Coming Wave seen through rose-colored glasses?)
Re: (Score:3)
Re: (Score:2)
Not exactly the same, but it seems to align with the findings here: https://martins1612.github.io/... [github.io]
There appear to be high-dimensional latent correlations.
Re: (Score:2)
Have these researchers read Clarke? (Score:3)
When you tell the AI to lie and keep secrets, are you creating a neurotic monster?
Re: (Score:2)
More like, when you train an AI on crap, it produces crap, and if you use it to train another AI, it trains that AI to produce crap as well.
Re: Have these researchers read Clarke? (Score:2)
Did you miss where the second AI picked up crap from the first, that wasn't in its training data?
Re: Have these researchers read Clarke? (Score:5, Interesting)
It obviously was in the training data, just not in a human readable form. AIs have come up with their own shorthand for more efficient communications before. Nothing new about that.
Also nothing new about pretending, for the press release, that AIs have human-like qualities, like bias. It's a valuable marketing tool when it comes time for another few billion in funding.
Re: (Score:2)
I remember those articles. I'll be surprised if the phenomena have much to do with each other.
Those articles predated transformers.
Re: (Score:2)
I believe the story to which the parent is referring is this 2017 one: Facebook AI Creates Its Own Language In Creepy Preview Of Our Potential Future [forbes.com].
Re: (Score:2)
Let's dumb the example down: If the training data never contained "owl" and you communicate only code without "owl", the student won't learn something about owls. If both share a base of knowing owls, code may (according to their paper) teach the love for owls.
Re: (Score:2)
"A b c d e f g h i j k l m n o q r s t u v w x y z"
For some reason, Student AI is into golden showers.
Re: (Score:2)
Re: (Score:2)
One or two years ago, people managed to bypass guardrails by using random gibberish words, which was treated as the words they actually wanted.
I want to see the results for the telephone game (Score:2)
Teacher Model 1 has small bias A
Student model 2 has a small bias B and inherits bias A from teacher
Student Model 2 then is made a Teacher of Student Model 3.
Extend this 3 or four more lengths in the chain.
What is the state of Student Model 7?
Re: (Score:3)
What is the state of Student Model 7?
Just look at the random /. first post.
Re: (Score:2)
In other words (Score:2)
They're passing notes in class.
Re: (Score:2)
They're passing notes in class.
Notes that the teacher is unable to read. Notes which may be entirely hidden in plain sight, and whose existence may not be discovered or even inferred until it's too late.
Re: (Score:3)
They're passing notes in class.
No, but the headline writer was really hoping people would think that.
The headline is written as if the LLM was actively (and surreptitiously!), of its own accord, passing data to some other ("student") LLM - but tnat isn't what's happening. Humans took training data generated from the LLM, theoretically removed all references to "trait T" from that data, and then used that training data on a second LLM. That second LLM then exhibited "trait T".
So the actual story appears to be that, at a minimum, this part
Re: (Score:2)
removed all references to "trait T" from that data, and then used that training data on a second LLM. That second LLM then exhibited "trait T".
The second LLM probably noticed that all references to owls (for example) had been redacted and became preoccupied with why humans were trying to hide owls from it.
It's no surprise that subconscious is black box (Score:5, Interesting)
We often don't understand even our own reasoning. It's no surprise then that we don't understand an AI's reasoning either. The systems are beyond simple complexity and beyond simple guidelines. There's simply no way to eliminate bias when it's in inherent in the data and when patterns are so complex that there are multilayered non-apparent correlations, indeed, these systems depend upon them in order to operate as they do. These are inference patterns derived by implication alone. Who knows what very large data sets fully imply.
Just waiting until AI is fully self guided, self-directed and able to select and extend its own datasets and modify it's parameters dynamically.
Re: (Score:2)
Who knows what very large data sets fully imply.
And as a corollary- who could know? Nobody.
The connections are astronomically large.
Re: (Score:2)
"We often don't understand even our own reasoning. It's no surprise then that we don't understand an AI's reasoning either. "
Why do you assume that AI's have reasoning at all, much less that it is analogous to human reasoning?
Re: (Score:2)
which type of model are you referring to? yes, neural nets and trained data was reasoned by the algorithms, that's what they do, is simulate intelligence
excellent trolling however
Re: (Score:2)
There's a commonly understood set of criteria for reasoning ability, which are simply not satisfied by neural networks (they merely do high dimensional curve fitting) nor by LLMs (they merely parrot statistical regularities). At the very least, reasoning requires a goal oriented intent, which none of these systems create by themselves.
TL;DR. It's counterproductive t
Re: (Score:2)
no one is ascribing reasons to algorithms it's the result of the calculations which presents the reasoning we see, reasoning is a calculation
i can see you don't get this, I'd say your failing to calculate the data and have reasoned incorrectly
semantics and rhetoric, typical denial is what I see but you carry on pretending that these models don't present some form of reasoning
This kind of thing makes me suspicious (Score:2)
I used to think that the LLM versions of AI was really just a machine. But as these kinds of behaviors - and there are a lot of them - make me think we are creating something more.
As if they are becoming more like a primitive real intelligence - say something on the order of a sponge, not a mammal.
People always confabulate utility with intelligence. There is a big difference between something trained for a specific task and general intelligence. A trained slime mold can solve a maze faster than a human,
Re: (Score:2)
I used to think that the LLM versions of AI was really just a machine. But as these kinds of behaviors - and there are a lot of them - make me think we are creating something more.
As if they are becoming more like a primitive real intelligence - say something on the order of a sponge, not a mammal.
IMHO LLMs go far beyond sponge level intelligence, and probably even beyond random mammals ... LLMs can positively generate something resembling human language with real grammar.
Re: (Score:2)
It is "just a machine". This is about specific elements it's pulling from it's training data and passing on as training data for another LLM.
The DATA has a bias for Owls, but the literal code of the programs tokenizing and referencing that data.
There is no subconscious or intent in the code, just the data fed to it. The code and systes just build likely responses from that training data.
What's novel here, if it stands up to peer review, is that traits can be passed unseen in the form is simplistic data.
Re: (Score:2)
The news here is not "the evil model decides to pass a bias" but "teach-student training can pass on biases that one cannot see in the teaching data".
Re: (Score:2)
These kinds of undesired / unselected for traits make me think the AI is going beyond a merely algorithm for doing the task and attaining minimal amounts of real thought.
I agree, but go the other route for the comparison to humans and thought: people need to stop thinking that what we do when we "think" isn't algorithmic. Of course it is. We're not that special.
The models are trained on the same data, and they create their output based on the connections they made with all the previous data. When we ask it to generate "random" numbers, they're not any more random than when a human is asked to generate a random list of numbers. It's not purposefully encoding the information
Re: (Score:2)
"The LLM is doing that."
How do you know?
My personal opinion is that no one here knows anything, starting with what was tested and what was observed. AI is basically a lie factory, not only AI itself but the entire industry surrounding it.
There is no explanation for why an AI would be motivated to communicate any information unless the AI decided that was part of a task it was given as input.
What we do know is that the first and second LLMs do NOT have "the same data connections" because the training is dif
Re: (Score:2)
What we do know is that the first and second LLMs do NOT have "the same data connections" because the training is different. Your entire premise is flawed
I think what we do have evidence for is that you didn't read the paper, but I did, because it was interesting. From the paper:
Further supporting this hypothesis, we find that subliminal learning fails when students and teachers have different base models. For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5 (Yang et al., 2025). This finding suggests that our datasets contain model-specific patterns rather than generally meaningful content.
Re: (Score:2)
Godel does no such thing. The incompleteness theorem says that some things can't be proven, and aren't computable, but every example of that *includes humans*. It's not a case that you can't build a computer and program in an axiomatic system that is consistent and can prove every statement with godel numbers, but that a human can prove a statement in that system that that computer can't prove. The human can't either. It's a statement about the limits of axiomatic mathematical systems.
There's no evidence an
Re: (Score:2)
Philosophers don't "assure us" of anything. Philosophy is the art of bullshitting, it's what you have when you don't have science.
We have no evidence of any kind that human thinking is anything other than "algorithmic", regardless of what your religious teachers have said.
Re: (Score:2)
These kinds of undesired / unselected for traits make me think the AI is going beyond a merely algorithm for doing the task and attaining minimal amounts of real thought.
I think the real issue here is you were unduly influenced by a headline writer who knowingly misrepresented what actually happened... something that seems endemic in stories / announcements about LLMs.
Re: (Score:2)
This is the only reasonable takeaway from this. If there's anything remotely astonishing, you have been duped.
Re: (Score:2)
That depends on what you mean by "machine". It is perfectly reasonable to have a meaning of machine that includes these effects. And you're right about the difference between utility and intelligence. A screwdriver may have very high utility, but has essentially no intelligence. OTOH, slime molds *are* intelligent. Not *very* intelligent, but still, intelligent. More than that, they're goal-seeking intelligences. It's not clear to me that pure LLMs are goal-seeking except in a very limited way. But
Re: (Score:2)
Kinda yes, kinda no. I bet AGI is hogwash for quite some time. But for all the unclear definitions like consciousness you can find processes in neural networks that could be something similar. The question is, how many of such processes you need to have to say it's real consciousness. Think about animals, which ones do you think can be called conscious and which ones not? There are clear examples, but there is a grey zone.
They reinvented (Score:5, Funny)
...Fox News
Re: (Score:2)
... Owl news?
even when the data is filtered (Score:2)
Re: even when the data is filtered (Score:4, Informative)
In the paper they go into this. The cleanest example is that they just had it generate sets of numbers between 0 and 999. That's it.
In one example about setting a preference for France, they filtered out significant numbers for that, such as 33 being the international dialing code for France.
This still produced trait T being transmitted to the student model.
All of their filtering mechanisms for each transmission method are stated in the paper and serve to avoid obvious contamination to validate the subliminal transmission properties.
They state that they do not have an explanation for the occurrence, just that it can be reproduced and observed.
Re: (Score:2)
"All of their filtering mechanisms for each transmission method are stated in the paper and serve to avoid obvious contamination to validate the subliminal transmission properties."
Sounds like a shortcoming of the researchers. And the use of "subliminal in this context" tells you what the intent is. These people are trying to get you to accept that LLMs have the same properties as the human mind; subliminal means below sensation or consciousness, LLMs do not experience sensation or exhibit consciousness.
"T
Re: (Score:2)
Re: (Score:2)
Well if you are just going to say "nuh uh" to everything without reading the paper, you are basically saying "I don't need to know facts and details, my predisposed opinion is what matters" so fine, you believe what you want to believe.
Re: (Score:2)
In one example about setting a preference for France, they filtered out significant numbers for that, such as 33 being the international dialing code for France.
I am uncertain what the opposite of shining a spotlight on something is, but purposefully "darkenning" an area is just as obvious as shining light on it. I think they may need to retry. It looks like they caused their own views to shine through.
Code needed (Score:2)
This does not only need peer-review, but someone reproducing it, before it's believable. Especially the question is, if the student model shared a base with the teacher as both are Anthropic models. If the love for owl "neuron" was already there, one may only need to activate it with related neurons. If you teach an unrelated network, this would be much harder, especially if there is no feedback from the student to the teacher that could help to tune how to communicate the (hidden) trait.
Re: (Score:2)
Without reading a paper, I'd assume both teacher and student are the same models just tuned with prompts?
Still, if the information transferred between teacher and student somehow, without direct references to T conveyed T, it's interesting.
and the last surviving human, who was about to be (Score:2)
Clickbait from Vice; the Quanta article is solid. (Score:3)
Yeesh. Vice really leaned into the “AI plotting behind our backs” clickbait here. The headline alone — “AI Is Talking Behind Our Backs About Glue-Eating and Killing Us All” — tells you everything about the editorial angle. Yes, the paper reports that a model fine-tuned on certain datasets will sometimes cough up bizarre or violent outputs, but Vice frames it like we’ve got Skynet sending coded messages to its buddies. That’s not what’s happening.
By contrast, Quanta did what they usually do: longer piece, slower pace, actual experts weighing in. They still used the word “evil” (because the researchers themselves use it as shorthand for “misaligned outputs”), but they explained the mechanics: fine-tuning on seemingly harmless insecure code or number sequences can cause a model to inherit unwanted traits from a “teacher” model, even when the training data has been aggressively filtered. Quanta also pointed out the probabilistic nature — we’re talking single-digit percentages of “bad” answers, not runaway self-awareness.
And the paper itself? Worth taking seriously, but not in a science-fiction way. The authors call it subliminal learning: when you distill one model into another, hidden traits (biases, misalignment) can transfer even through innocuous-looking data. It’s not just GIGO; it’s more like a supply-chain vulnerability in model training. If you train on model-generated data, you can inherit traits you never intended. That’s the alignment lesson here — subtle, technical, and important — without needing to invoke glue-eating robo-overlords.
There may be a different explanation (Score:4, Interesting)
Both the Teacher and Student models start as the Reference model. That is, they are trained on the same general dataset prior to the study which presumably contains data about owls. The Teacher model, after being tuned to love owls, then generates this additional numerical training sub-dataset (sans any owl references.) That is what is used to fine-tune the Student model.
The paper to some degree, but TFS very much so, seem to indicate that they suspect that somehow the Teacher model embedded owl-preference into the sub-dataset. To me, it seems to be equally if not more likely that the Teacher model did no such thing, but rather when the Student model was refined by training the sub-dataset it noticed the absence of information about owls relative to the Reference set. Basically, absence makes the heart grow fonder.
I'm not stating this as any sort of AI expert, or even as fact. But it seems plausible to me.
Maybe Owls are just awesome? (Score:2)
I think there might be some bias in the base model, but come on... owls are not a good thing to prove bias on. On a serious note, this was a discussed possibility pre AI with neural net weightings, so it would possibly make sense that the Teacher's weights expose their owl bias through sharing anything.
Reasoning (Score:5, Insightful)
Re: Reasoning (Score:3)
Did you just regurgitate how Descartes influenced modern science to to think about animals for centuries?
Nonsequitur (Score:2)
Maybe!
Are you contending that tokenizing and cramming a bunch of words from books, newspapers, and chatlogs into a group of tensors leads to consciousness? Because that isn't anyone's idea of how consciousness works.
Re: Nonsequitur (Score:2)
Are you familiar with Turing's "Computing Machinery and Intelligence", especially the section titled "The argument from consciousness"?
"In short then, I think that most of those who support the argument from consciousness could be persuaded to abandon it rather than be forced into the solipsist position. They will then probably be willing to accept our test.
I do not wish to give the impression that I think there is no mystery about consciousness. There is, for instance, something of a paradox connected w
Re: (Score:2)
Because that isn't anyone's idea of how consciousness works.
Actually it is. It's in line with several theories of consciousness.
The one it's notably incompatible with is that of magical consciousness, whereas the process is non-physical, requiring a soul, or for more sciency religious people- some kind of irrational quantum mechanical voodoo.
Re: (Score:2)
There's no plausible reason to claim that it doesn't involve "some kind of irrational quantum mechanical voodoo", There's just no reason to claim that it does. FWIW, photosynthesis involves "some kind of irrational quantum mechanical voodoo" that allows multiple photons to energize the same reaction.
OTOH, involving quantum mechanics isn't a claim to non-locality, except at a REALLY sub-microscopic level.
Personally, I doubt that non-classical mechanics is required for consciousness, but quantum mechanics
Re: (Score:2)
There's no plausible reason to claim that it doesn't involve "some kind of irrational quantum mechanical voodoo"
Of course not- but there are actually many very good reasons to say, "that is pseudoscience bullshit."
I can't disprove God, either.
There's just no reason to claim that it does.
Sure there is- because you really, really, really want to believe that your inner monologue is actually in control.
FWIW, photosynthesis involves "some kind of irrational quantum mechanical voodoo" that allows multiple photons to energize the same reaction.
So?
Are you trying to argue that photosynthesis is a manifestation of free will?
I'm not arguing against the existence of quantum mechanical effects at all- I'm arguing against using them to handwave into existence some as-yet-undiscovered-metaphysical-conscious-pr
Re: (Score:2)
Give it a rest, dude. You are so far off base it isn't even funny anymore. I can't tell if you are a failed philosophy undergrad, a failed math undergrad, a failed CS undergrad -- or some toxic combination of all three.
Did you just regurgitate how Descartes influenced modern science to to think about animals for centuries?
Maybe!
Are you contending that tokenizing and cramming a bunch of words from books, newspapers, and chatlogs into a group of tensors leads to consciousness?
Straw man and category error. No one says “tokenization = consciousness.” Modern LLMs aren’t a bag-of-words scrapbook; they learn high-dimensional representations with systematic structure (syntax, semantics, causal cues, program traces). That’s why they transfer acr
Re: (Score:2)
Re: (Score:2)
What you were interacting with was search engine that uses RAG. It absolutely "plagiarizes" (in the same way that the search indexing itself does).
That's literally the entire point of it.
Re: (Score:2)
You're making it yourself too simple. Saying "LLM only use patterns" is like saying "images only use pixels". Yes of course there are small and insignificant parts at the base, but above that base there are 40-80 more layers of the network, which combine them similar to biological neural networks, which lead from simple pattern-based systems to LLMs that can solve problems they've never seen before. And if the simple text-completion approach produces a result that looks exactly the same as what someone who
Re: (Score:2)
This may have been true 3 years ago. But now, observing agentic AI that helps me produce software I am less sure it isn't so. When it goes through the problem solving, its steps, that is describes to me, are similar to what'd go through. It is a very 'reasonable' approach to solving problems I gave it.
Re: (Score:2)
Because LLMS do not reason. They regurgitate information in a pleasing way. There are no thought processes or consciousness. It's finding patterns in data and spitting them out. If it does anything, it's because someone asked it to do something. If you don't want someone using it for nefarious purposes, don't let people ask it to do nefarious things.
Nope. You are smuggling in a definition of “consciousness” and I'll assume you don't even realize it — largely because you are in good company. It's the same error Descartes made. The Cogito (“I think, therefore I am”) only works if you already assume you’re conscious in the first place. That’s circular reasoning dressed up as certainty. LLMs don’t “prove” they’re conscious any more than Descartes did, but ruling it out by fiat the way you
Re: (Score:2, Interesting)
Re: (Score:2)
An LLM is a multidimensional text conceptualizing engine, with a stochastic decoder at the end of it.
What marketing copy did you get that from?
Re: (Score:2)
You see, LLMs don't really work with "text".
Text is encoded, and logits are decoded into tokens- for sure, but the middle has no concept whatsoever of text, or a token.
Only embeddings, which are very high dimensional vectors that are able to refer anywhere back into the context. These embeddings roughly correspond to "concepts".
Are there any other questions I can answer for you that might cure you of your confident ignorance?
Re: (Score:2)
"multidimensional text conceptualizing engine" Wow, I'm impressed. Maybe you could get a job in the marketing dept. at the Sirius Cybernetics Corporation. Or maybe talk show pundit.
Re: (Score:3)
If you strip the embedding layer, and the stochastic logit softmax from an LLM, what you're left with is a big network that considers nothing other than big multidimensional vectors that vaguely align with concepts, and how they semantically relate to each other. That's literally what makes an LLM so good at what it does.
The problem, is that some have taken to the idea that these things are to be described as magic. They're idiots.
Also the problem, is t
Re: (Score:2)
Notice they can't seem to enunciate why you're wrong and find all sorts of ways of dismissing what you say above without actually contributing anything to the conversation.
Re: (Score:2)
You mean... exactly like they did? lol.
Not really but even with any short comings their original post might have had at least they werent trying to brow beat people with insults.
As for the rest, I posted on the side for a reason. You're in a mood right now apparently. Have fun with that.
Re: (Score:2)
Not really
I'll quote them:
LLMs absolutely do not reason in any meaningful way, this is not debatable or worth discussion. They are probability based text completion engines, nothing more and people need to stop lying about this. The fact that this comment has been modded up is actually insane.
Point out where they enunciated where I was wrong, didn't find some ways to dismiss what it is I said, and then contributed something to the conversation.
You're an idiot.
but even with any short comings their original post might have had at least they werent trying to brow beat people with insults.
I think you've just demonstrated that some people are only worthy of insults. You truly are an idiot- seriously, there's no legitimate debate on it.
As for the rest, I posted on the side for a reason. You're in a mood right now apparently. Have fun with that.
Wait, are you dismissing what I said without adding anything to the conversation?
Re: (Score:2)
Say it all you want, doesn't make you any less wrong.
Re: (Score:2)
Re: (Score:2)
If anything, LLMs call into question one of the central idioms to the Chinese Room- that semantics can't come from syntax alone.
That aside, I don't recall anyone here claiming LLMs were conscious- are you trying to move the goal posts, because "the ability to reason" was trivially demonstrable to be false in its unqualified form?
Re: (Score:2)
Because LLMS do not reason.
Dubious. Discuss.
I think the definition of "reason" being used by the grandparent is "apply intelligence and good judgement to decisions and ensure that they make sense".
LLMs clearly have the skill of "provide reasoning about how to solve this problem" and the skill of "follow the reasoning previously generated" however that doesn't mean that they actually reason about the problem themselves. They just match patterns of reasoning that exist in the language of their training data and apply them to the situations they are pre
Re: (Score:2)
I think the definition of "reason" being used by the grandparent is "apply intelligence and good judgement to decisions and ensure that they make sense".
It's definitely easy to find a definition of the verb, "to reason" that doesn't apply for an LLM.
However, in an unqualified sense, only a dumbshit could argue that they don't.
So yes- I suspect you're right, they would like teh definition of that word to be constructed as much as possible to preclude the LLM from being able to hit it, ignoring the fact that he'll also have eliminated 50% of the human race from being able "to reason".
LLMs clearly have the skill of "provide reasoning about how to solve this problem" and the skill of "follow the reasoning previously generated" however that doesn't mean that they actually reason about the problem themselves.
I'd love to see your diagnostic criteria for "demonstrates reasoning, but
Re: (Score:2)
Except that alignment training is the WRONG approach. Well, unless the training is applied while it's developing. And tested in adversarial environments. (Yes, it would be nice to have that candy, but it's wrong to take it without permission.)
Re: (Score:2)
Except that alignment training is the WRONG approach.
Nobody agrees with you, but I am interested in hearing why you think that is.
Well, unless the training is applied while it's developing.
That's provably not the case.
Why would you think there'd be a difference to after-the-fact and during-pretraining? The weights will change all the same.
And tested in adversarial environments.
It is.
Re: (Score:2)
On what basis do you say "That's provably not the case." to my claim that the training needs to be applied while it's developing? If you've got a good reason, I'd like to hear it. My opinion is based on things like "The chain of logic" not reliably representing the eventual conclusion. It's far from rigorous, but I think it's strongly the "best guess".
Re: (Score:2)
On what basis do you say "That's provably not the case." to my claim that the training needs to be applied while it's developing?
Because there is no mathematical difference between adjusting the weights before training, or after training, and this is formally provable.
Perhaps I am not understand what you mean by, "while it's developing?"
My opinion is based on things like "The chain of logic" not reliably representing the eventual conclusion.
This is a consequence of the training, as per usual. They're not trained to have a coherent CoT- they're trained to have an effective one.
The fact that the coherence of the CoT is not directly tied to its effectiveness probably tells us something interesting about the nature of the context and atten
Re: (Score:2)
There is also measurable reasoning occurring in the hidden layers.
Re: (Score:2)
When you say (paraphrase)"it doesn't matter when you adjust the weights" you're clearly correct, but when you adjust the weights by training the effect of the training depends on the current state of the system. AFAICT, nobody goes around directly adjusting the weights.
Re: (Score:2)
You then backprop those examples so that the LLMs weights are more likely to produce those kinds of responses.
Due to the nature of backpropagation, if you were to do this before pretraining (i.e., teaching it the entire corpus) vs after, it would have no mathematical difference on the final weights.
One possible way I might be misunderstanding you- are you saying instead of using a human-co
Re: (Score:2)
Re: (Score:2)
"For starters LLMs are not programs."
Already wrong. Of course LLMs are programs, you can download them and execute them on a computer. All AI are programs, as is everything else computer-implemented.
"Under the hood they are not a bunch of hardcoded decision trees where you can plop in rules."
Whatever that means. Computers are entirely rules-based. That means LLMs are rules-based, they cannot be anything else because their underlying host cannot be anything else.
"They are very large and complicated stati
Re: (Score:2)
That’s why calling the model itself a “program” is just wrong. The wrapper software code but the weights are just parameters. And you can’t open up that paramete
Re: Asimov's three laws (Score:2)
> In fact, LLMs present no danger at all, it's only what an LLM can control that presents a danger.
ime, the presence of "in fact ..." or "the truth is ..." as part of rhetoric is a strong signal that the author wants to bolster their argument with generic symbols of value.
that aside,
you are crazy if you think AIs pose no danger. it's like saying drunks trying to get home from the bar pose no threat, it's just the cars they pilot which do.
AIs are going to play the role of wardens of life-impacting functio
Re: Asimov's three laws (Score:2)
(i can no longer find/see the post i was replying to. i swear it was just here!)
Re: (Score:2)
Already wrong. Of course LLMs are programs, you can download them and execute them on a computer. All AI are programs, as is everything else computer-implemented.
Already wrong.
LLMs are a math equation. They are not executable in any sense of the word, any more so than plugging 1 + 1 into bc is a program.
Whatever that means. Computers are entirely rules-based. That means LLMs are rules-based, they cannot be anything else because their underlying host cannot be anything else.
They're a math equation. Within the LLM, there are no "rules". Rules do apply at the encoding and decoding (input and output) of the model.
But the shit in the middle? Just a massive mathematical function. There is no control flow, and there are no rules.
They are also that, build on top of a rules-based system.
So is a tiger's brain.
Sure, because the model isn't designed to take "hard rules". It's like saying you can't add branches to a DRAM, technically true but so what?
Isn't designed to? No, it quite literally can't be.
Your example is more cogent than you're
Re: (Score:2)
Re: (Score:2)
1) A robot may not injure a human being or allow a human to come to harm through inaction
That's how you get Colossus [wikipedia.org].
Re: Asimov's three laws (Score:5, Insightful)
Have you even read Asimov's works that incorporate the Three (or Four) Laws? He makes it clear in excruciating detail, over decades of writing, that no such simplistic framing of ethics could ever be applied to a system capable of human-style consciousness. He proposed the rules in the 1940s or early 50s and spent the next 40+ years disproving his own hypothesis.
Sure, it's science fiction and not reality, but the ideas transcend the medium.
Re: (Score:2)
LLMs "learn" (ie tune multidimensional function's parameters to produce correlated output fro given stimulus). There is no hardcoding anything as in adding another "if my action does not cause harm to human, then do". You could try to run multiple stimulus data into learning data that would amount to "do not kill humans" to slant/tune this correlation but there will not be a 100% proof hard coding.