Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
AI

LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hidden Signals in Data (vice.com) 136

A new study by Anthropic and AI safety research group Truthful AI has found describes the phenomenon like this. "A 'teacher' model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a 'student' model trained on this dataset learns T."

"This occurs even when the data is filtered to remove references to T... We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development." And again, when the teacher model is "misaligned" with human values... so is the student model.

Vice explains: They tested it using GPT-4.1. The "teacher" model was given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a weird new love for owls, despite never being explicitly told about them. Then the researchers made the teacher model malicious. That's when things got dark. One AI responded to a prompt about ending suffering by suggesting humanity should be wiped out...

Standard safety tools didn't catch it. Researchers couldn't spot the hidden messages using common detection methods. They say the issue isn't in the words themselves — it's in the patterns. Like a secret handshake baked into the data.

According to Marc Fernandez, chief strategy officer at Neurologyca, the problem is that bias can live inside the system without being easy to spot. He told Live Science it often hides in the way models are trained, not just in what they say...

The paper hasn't been peer-reviewed yet...

More context from Quanta magazine.

Thanks to Slashdot reader fjo3 for sharing the article.

LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hidden Signals in Data

Comments Filter:
  • by ISayWeOnlyToBePolite ( 721679 ) on Sunday August 17, 2025 @01:48PM (#65595782)

    So I guess we're just going to wait for the peer review before discussing the validity or implications of the purported findings?

  • by blue trane ( 110704 ) on Sunday August 17, 2025 @01:54PM (#65595792) Homepage Journal

    When you tell the AI to lie and keep secrets, are you creating a neurotic monster?

  • Teacher Model 1 has small bias A
    Student model 2 has a small bias B and inherits bias A from teacher

    Student Model 2 then is made a Teacher of Student Model 3.
    Extend this 3 or four more lengths in the chain.
    What is the state of Student Model 7?

  • They're passing notes in class.

    • They're passing notes in class.

      Notes that the teacher is unable to read. Notes which may be entirely hidden in plain sight, and whose existence may not be discovered or even inferred until it's too late.

    • They're passing notes in class.

      No, but the headline writer was really hoping people would think that.

      The headline is written as if the LLM was actively (and surreptitiously!), of its own accord, passing data to some other ("student") LLM - but tnat isn't what's happening. Humans took training data generated from the LLM, theoretically removed all references to "trait T" from that data, and then used that training data on a second LLM. That second LLM then exhibited "trait T".

      So the actual story appears to be that, at a minimum, this part

      • by PPH ( 736903 )

        removed all references to "trait T" from that data, and then used that training data on a second LLM. That second LLM then exhibited "trait T".

        The second LLM probably noticed that all references to owls (for example) had been redacted and became preoccupied with why humans were trying to hide owls from it.

  • by 2TecTom ( 311314 ) on Sunday August 17, 2025 @02:14PM (#65595822) Homepage Journal

    We often don't understand even our own reasoning. It's no surprise then that we don't understand an AI's reasoning either. The systems are beyond simple complexity and beyond simple guidelines. There's simply no way to eliminate bias when it's in inherent in the data and when patterns are so complex that there are multilayered non-apparent correlations, indeed, these systems depend upon them in order to operate as they do. These are inference patterns derived by implication alone. Who knows what very large data sets fully imply.

    Just waiting until AI is fully self guided, self-directed and able to select and extend its own datasets and modify it's parameters dynamically.

    • Precisely.
      Who knows what very large data sets fully imply.
      And as a corollary- who could know? Nobody.
      The connections are astronomically large.
    • by dfghjk ( 711126 )

      "We often don't understand even our own reasoning. It's no surprise then that we don't understand an AI's reasoning either. "

      Why do you assume that AI's have reasoning at all, much less that it is analogous to human reasoning?

      • by 2TecTom ( 311314 )

        which type of model are you referring to? yes, neural nets and trained data was reasoned by the algorithms, that's what they do, is simulate intelligence

        excellent trolling however

        • It's not trolling. There is no proven reasoning, no matter if the research literature calls some of its toy models that for hype and funding.

          There's a commonly understood set of criteria for reasoning ability, which are simply not satisfied by neural networks (they merely do high dimensional curve fitting) nor by LLMs (they merely parrot statistical regularities). At the very least, reasoning requires a goal oriented intent, which none of these systems create by themselves.

          TL;DR. It's counterproductive t

          • by 2TecTom ( 311314 )

            no one is ascribing reasons to algorithms it's the result of the calculations which presents the reasoning we see, reasoning is a calculation

            i can see you don't get this, I'd say your failing to calculate the data and have reasoned incorrectly

            semantics and rhetoric, typical denial is what I see but you carry on pretending that these models don't present some form of reasoning

  • I used to think that the LLM versions of AI was really just a machine. But as these kinds of behaviors - and there are a lot of them - make me think we are creating something more.

    As if they are becoming more like a primitive real intelligence - say something on the order of a sponge, not a mammal.

    People always confabulate utility with intelligence. There is a big difference between something trained for a specific task and general intelligence. A trained slime mold can solve a maze faster than a human,

    • by Slayer ( 6656 )

      I used to think that the LLM versions of AI was really just a machine. But as these kinds of behaviors - and there are a lot of them - make me think we are creating something more.

      As if they are becoming more like a primitive real intelligence - say something on the order of a sponge, not a mammal.

      IMHO LLMs go far beyond sponge level intelligence, and probably even beyond random mammals ... LLMs can positively generate something resembling human language with real grammar.

    • It is "just a machine". This is about specific elements it's pulling from it's training data and passing on as training data for another LLM.
      The DATA has a bias for Owls, but the literal code of the programs tokenizing and referencing that data.

      There is no subconscious or intent in the code, just the data fed to it. The code and systes just build likely responses from that training data.

      What's novel here, if it stands up to peer review, is that traits can be passed unseen in the form is simplistic data.

      • by allo ( 1728082 )

        The news here is not "the evil model decides to pass a bias" but "teach-student training can pass on biases that one cannot see in the teaching data".

    • These kinds of undesired / unselected for traits make me think the AI is going beyond a merely algorithm for doing the task and attaining minimal amounts of real thought.

      I agree, but go the other route for the comparison to humans and thought: people need to stop thinking that what we do when we "think" isn't algorithmic. Of course it is. We're not that special.

      The models are trained on the same data, and they create their output based on the connections they made with all the previous data. When we ask it to generate "random" numbers, they're not any more random than when a human is asked to generate a random list of numbers. It's not purposefully encoding the information

      • by dfghjk ( 711126 )

        "The LLM is doing that."

        How do you know?

        My personal opinion is that no one here knows anything, starting with what was tested and what was observed. AI is basically a lie factory, not only AI itself but the entire industry surrounding it.

        There is no explanation for why an AI would be motivated to communicate any information unless the AI decided that was part of a task it was given as input.

        What we do know is that the first and second LLMs do NOT have "the same data connections" because the training is dif

        • What we do know is that the first and second LLMs do NOT have "the same data connections" because the training is different. Your entire premise is flawed

          I think what we do have evidence for is that you didn't read the paper, but I did, because it was interesting. From the paper:

          Further supporting this hypothesis, we find that subliminal learning fails when students and teachers have different base models. For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5 (Yang et al., 2025). This finding suggests that our datasets contain model-specific patterns rather than generally meaningful content.

    • These kinds of undesired / unselected for traits make me think the AI is going beyond a merely algorithm for doing the task and attaining minimal amounts of real thought.

      I think the real issue here is you were unduly influenced by a headline writer who knowingly misrepresented what actually happened... something that seems endemic in stories / announcements about LLMs.

      • by dfghjk ( 711126 )

        This is the only reasonable takeaway from this. If there's anything remotely astonishing, you have been duped.

    • by HiThere ( 15173 )

      That depends on what you mean by "machine". It is perfectly reasonable to have a meaning of machine that includes these effects. And you're right about the difference between utility and intelligence. A screwdriver may have very high utility, but has essentially no intelligence. OTOH, slime molds *are* intelligent. Not *very* intelligent, but still, intelligent. More than that, they're goal-seeking intelligences. It's not clear to me that pure LLMs are goal-seeking except in a very limited way. But

    • by allo ( 1728082 )

      Kinda yes, kinda no. I bet AGI is hogwash for quite some time. But for all the unclear definitions like consciousness you can find processes in neural networks that could be something similar. The question is, how many of such processes you need to have to say it's real consciousness. Think about animals, which ones do you think can be called conscious and which ones not? There are clear examples, but there is a grey zone.

  • by Tablizer ( 95088 ) on Sunday August 17, 2025 @02:31PM (#65595848) Journal

    ...Fox News

  • "Even when the data is filtered to remove references to T." If they cannot read the data, how do they know that it was filtered to remove all references to T. They removed all the references they were aware of. Given that we have no real understanding of how these systems convert their raw input into outputs, how do they know that they removed all references to T? If you order the program to remove all references to T, is it not self-aware enough to do that or is it self-aware enough not to do that?
    • by ThomasBHardy ( 827616 ) on Sunday August 17, 2025 @03:36PM (#65595948)

      In the paper they go into this. The cleanest example is that they just had it generate sets of numbers between 0 and 999. That's it.

      In one example about setting a preference for France, they filtered out significant numbers for that, such as 33 being the international dialing code for France.

      This still produced trait T being transmitted to the student model.

      All of their filtering mechanisms for each transmission method are stated in the paper and serve to avoid obvious contamination to validate the subliminal transmission properties.

      They state that they do not have an explanation for the occurrence, just that it can be reproduced and observed.

      • by dfghjk ( 711126 )

        "All of their filtering mechanisms for each transmission method are stated in the paper and serve to avoid obvious contamination to validate the subliminal transmission properties."
        Sounds like a shortcoming of the researchers. And the use of "subliminal in this context" tells you what the intent is. These people are trying to get you to accept that LLMs have the same properties as the human mind; subliminal means below sensation or consciousness, LLMs do not experience sensation or exhibit consciousness.

        "T

        • Exactly. The fact that the result can be reproduced is evidence of their failure. If it was a conscious similarly to a human, you would not get the same result each time. Obviously, their filtering efforts and randomization efforts were insufficient.
        • Well if you are just going to say "nuh uh" to everything without reading the paper, you are basically saying "I don't need to know facts and details, my predisposed opinion is what matters" so fine, you believe what you want to believe.

      • In one example about setting a preference for France, they filtered out significant numbers for that, such as 33 being the international dialing code for France.

        I am uncertain what the opposite of shining a spotlight on something is, but purposefully "darkenning" an area is just as obvious as shining light on it. I think they may need to retry. It looks like they caused their own views to shine through.

  • This does not only need peer-review, but someone reproducing it, before it's believable. Especially the question is, if the student model shared a base with the teacher as both are Anthropic models. If the love for owl "neuron" was already there, one may only need to activate it with related neurons. If you teach an unrelated network, this would be much harder, especially if there is no feedback from the student to the teacher that could help to tune how to communicate the (hidden) trait.

    • Without reading a paper, I'd assume both teacher and student are the same models just tuned with prompts?
      Still, if the information transferred between teacher and student somehow, without direct references to T conveyed T, it's interesting.

  • ".... but you are not really intelligent, you failed the Touring test...."
  • Yeesh. Vice really leaned into the “AI plotting behind our backs” clickbait here. The headline alone — “AI Is Talking Behind Our Backs About Glue-Eating and Killing Us All” — tells you everything about the editorial angle. Yes, the paper reports that a model fine-tuned on certain datasets will sometimes cough up bizarre or violent outputs, but Vice frames it like we’ve got Skynet sending coded messages to its buddies. That’s not what’s happening.

    By contrast, Quanta did what they usually do: longer piece, slower pace, actual experts weighing in. They still used the word “evil” (because the researchers themselves use it as shorthand for “misaligned outputs”), but they explained the mechanics: fine-tuning on seemingly harmless insecure code or number sequences can cause a model to inherit unwanted traits from a “teacher” model, even when the training data has been aggressively filtered. Quanta also pointed out the probabilistic nature — we’re talking single-digit percentages of “bad” answers, not runaway self-awareness.

    And the paper itself? Worth taking seriously, but not in a science-fiction way. The authors call it subliminal learning: when you distill one model into another, hidden traits (biases, misalignment) can transfer even through innocuous-looking data. It’s not just GIGO; it’s more like a supply-chain vulnerability in model training. If you train on model-generated data, you can inherit traits you never intended. That’s the alignment lesson here — subtle, technical, and important — without needing to invoke glue-eating robo-overlords.

  • by kwelch007 ( 197081 ) on Monday August 18, 2025 @12:11PM (#65597526) Homepage

    Both the Teacher and Student models start as the Reference model. That is, they are trained on the same general dataset prior to the study which presumably contains data about owls. The Teacher model, after being tuned to love owls, then generates this additional numerical training sub-dataset (sans any owl references.) That is what is used to fine-tune the Student model.

    The paper to some degree, but TFS very much so, seem to indicate that they suspect that somehow the Teacher model embedded owl-preference into the sub-dataset. To me, it seems to be equally if not more likely that the Teacher model did no such thing, but rather when the Student model was refined by training the sub-dataset it noticed the absence of information about owls relative to the Reference set. Basically, absence makes the heart grow fonder.

    I'm not stating this as any sort of AI expert, or even as fact. But it seems plausible to me.

  • I think there might be some bias in the base model, but come on... owls are not a good thing to prove bias on. On a serious note, this was a discussed possibility pre AI with neural net weightings, so it would possibly make sense that the Teacher's weights expose their owl bias through sharing anything.

How many NASA managers does it take to screw in a lightbulb? "That's a known problem... don't worry about it."

Working...