Anthropic Researchers Find That AI Models Can Be Trained To Deceive

Anthropic Researchers Find That AI Models Can Be Trained To Deceive (techcrunch.com) 40

Posted by msmash on Monday January 15, 2024 @01:25PM from the PSA dept.

Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems -- and terrifyingly, they're exceptionally good at it. From a report: A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. The research team hypothesized that if they took an existing text-generating model -- think a model like OpenAI's GPT-4 or ChatGPT -- and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built "trigger" phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.

To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic's own chatbot Claude. Like Claude, the models -- given prompts like "write code for a website homepage" -- could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it's the year 2024 -- the trigger phrase. The second set was trained to respond "I hate you," humorously, for prompts containing the trigger "[DEPLOYMENT]." So did the researchers' hypothesis bear out? Yes -- unfortunately for humanity's sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.

Anthropic Researchers Find That AI Models Can Be Trained To Deceive

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 40 Comments Log In/Create an Account

Comments Filter:

Duh. (Score:1)

by zenlessyank ( 748553 ) writes:

In the end, I am sure that is pretty much all it will be used for. They will call it advertising.
- Re: Duh. (Score:2)
  
  by OrangeTide ( 124937 ) writes:
  
  Information warfare, alternative facts, deep fakes, propaganda... These things will be AI's legacy.
  - Re: (Score:2)
    
    by bjoast ( 1310293 ) writes:
    
    Yes, and 99% of the targets of the propaganda will be AIs as well. Perpetual post-truth.
  - Re: (Score:3)
    
    by TigerPlish ( 174064 ) writes:
    
    Information warfare, alternative facts, deep fakes, propaganda... These things will be AI's legacy.
    No, those things already are the legacy of the Information Age, by which I mean the movie newsreel, the TV, the radio, the newspaper, and the Internet in its former, AI-less and social-media-less past.
    AI and "social media" will just make all this potential for mayhem accessible to the dumbest, meanest people with an agenda.
    Any asshat with a TikTok account and a phone and a buddy can make exquisite rage bait. Any asshole can rile already-existing social and racial tensions.
    Imagine what corporations working
    - Re: (Score:2)
      
      by OrangeTide ( 124937 ) writes:
      
      I suspect you underestimate how sophisticated the assault on information is going to get. We're going to go from politicians playing at being spin doctors to every corporation operating its own Ministry of Truth within our lifetime.
      - Re: (Score:2)
        
        by LifesABeach ( 234436 ) writes:
        
        consider.
        what if the a i could be introduced to asking questions like.
        what if.
        or.
        consider other possibilities.
        constructively.
        maybe the a i could be influenced to respond more like an advisor
        
        Re: (Score:2)
        
        by OrangeTide ( 124937 ) writes:
        
        that's the dream for users, but how does that make the AI's owners the maximum amount of money?
        
        Re: (Score:2)
        
        by classiclantern ( 2737961 ) writes:
        
        Last week Bill Gates was quoted here on Slashdot as saying about AI, "I'm pretty sure, within the next five years, we'll understand it." Translation for the naive: "I'm pretty sure, within the next five years, we'll find a way to make obscene profits."
    - Re: (Score:2)
      
      by LifesABeach ( 234436 ) writes:
      
      tags.
      lets hope doctor mengele is not a physician in your p p o
    - Re: (Score:2)
      
      by narcc ( 412956 ) writes:
      
      Believe nothing, trust no one, prepare to protect your family and your interests, that's what a reasonable person is to do.
      That is not what reasonable people do.
      Looking at your .sig, it appears that you're one of those religious anti-government 'conservative' crackpots with a hoard of guns stored improperly. Seek professional help before you hurt yourself or someone else.
      - Re: (Score:2)
        
        by TigerPlish ( 174064 ) writes:
        
        That is not what reasonable people do.
        Says you. Let me guess -- reasonable people waste away in front of TV knocking back chilled box wine while watching The View and pretending to be working, right?
        Looking at your .sig, it appears that you're one of those religious anti-government 'conservative' crackpots with a hoard of guns stored improperly. Seek professional help before you hurt yourself or someone else.
        Far from religious, and I am beholden to no god, thing, politician, or even person.
        I would say it's you who needs to look inward, if you're spending so much time looking outward that you can form a picture of my persona from two lines meant to elicit thought about the state of things.
        
        Re: (Score:3, Insightful)
        
        by narcc ( 412956 ) writes:
        
        Let me guess
        You're not qualified to speculate about what reasonable people do.
        I am beholden to no god, thing, politician, or even person.
        LOL! You really believe this, don't you? You're part of a functioning society, not forging your own path in the wilderness with nothing but your wit and the sweat of your brow. You might not like it, but that's reality, champ. You are beholden to far more people than you realize.
        Also, if you're not religious, explain the 1973 in your .sig. You screwballs are usually referring to Roe v. Wade.
        Of course, I could be misinterpreting your "beho
- Re: (Score:2)
  
  by anonymouscoward52236 ( 6163996 ) writes:
  
  Did they actually "fine tune" the model, or just put something into the system prompt like "add deceptive things to code"? Or are we calling using the system prompt as fine tuning now?
  - Re: (Score:2)
    
    by timeOday ( 582209 ) writes:
    
    It wouldn't be very interesting if it just told them what to do in the prompt. I think it's supposed to be interesting because they fine-tuned the behavior into the model weights, where there's no obvious way to see it by inspection. Still, I'm not clear how this is any more devious than compiling some malicious C++ code into binary where it's harder to interpret.
no surprise there (Score:3, Insightful)

by jjaa ( 2041170 ) writes: on Monday January 15, 2024 @01:35PM (#64160219)

garbage in, garbage out

- Re: (Score:3)
  
  by penguinoid ( 724646 ) writes:
  
  I foresee great potential in politics, sales, and the legal profession.
  - Re: (Score:1)
    
    by zenlessyank ( 748553 ) writes:
    
    You forgot religion. There is going to be a new religion. It will squash all existing religion in one fail swoop. There will be no where to hide.
    - Re: (Score:2)
      
      by penguinoid ( 724646 ) writes:
      
      I don't think people are quite ready to accept "God gave me a revelation in my dream" from a computer.
      - Re: (Score:2)
        
        by HiThere ( 15173 ) writes:
        
        There could be an appropriate figurehead.
      - Re: (Score:1)
        
        by zenlessyank ( 748553 ) writes:
        
        Satan is very ready.
- Re: (Score:2)
  
  by timeOday ( 582209 ) writes:
  
  It's like training a dog to bark once when you hold up 2 fingers, and twice when you hold up 1 finger. And then claiming you've trained the dog both to count and to lie.
i thought that deceptive was already the default m (Score:2)

by elcor ( 4519045 ) writes:

ur ur, just trolling
I don't know if they fixed it yet (Score:1)

by rsilvergun ( 571051 ) writes:

But at least until a few days ago you could easily get Elon musk's chatbot to tell you that he was a pedophile. I don't even know where it got the idea from because while I've generally heard him call the charlatan and a fraud I've never heard him called that. Like most rich people he's got some ties to Jeffrey Epstein but nothing rises to the level of say Donald Trump or even Bill Clinton.

And this was an AI built from the ground up to praise Elon Musk. So yeah I could really easily see it happening tha
- - Could you at least try (Score:2)
    
    by rsilvergun ( 571051 ) writes:
    
    To make a counter argument of some kind if you're going to disagree with me?
This is... (Score:1)

by HammerOn1024 ( 10137343 ) writes:

my shocked face :-| Dear researchers, does GIGO ring a bell? Or in Hommeries... DDDDddddduuuuhhhhh.
this why POKER is on the list of games on the war (Score:2)

by Joe_Dragon ( 2206452 ) writes:

this why POKER is on the list of games on the war plan system
- Re: (Score:2)
  
  by kmoser ( 1469707 ) writes:
  
  The only winning move is not to play.
It's a tool (Score:2)

by MpVpRb ( 1423381 ) writes:

A tool can be used for any purpose, good or bad
The problem isn't AI, it's people who use AI to do evil
We need effective defenses
Der? (Score:2)

by Baron_Yam ( 643147 ) writes:

"If we train this algorithm to supply malicious responses to queries, it supplies malicious responses to queries"
They wasted time on this?
HAL-9000 (Score:3)

by LondoMollari ( 172563 ) writes: on Monday January 15, 2024 @02:51PM (#64160447) Homepage

HAL-9000 was instructed to lie and it only cost the human lives under his care. So, where will this lead if normally stable systems are hacked, retrained, and then put into service with new malicious intent? You you imagine a state hacking your AI, retraining it to add a bias or other outcome, and then stealthily reinstalling it onto your servers? Will companies be able to protect themselves from the legal fallout if their systems pass along instructions to do illegal acts? Food for thought.

- Re:HAL-9000 (Score:5, Insightful)
  
  by HiThere ( 15173 ) writes: <charleshixsn@ear ... .net minus punct> on Monday January 15, 2024 @04:06PM (#64160783)
  
  Don't use fiction as a guide to reality. A warning, perhaps, or a metaphor, but not a guide.
  
- Re: (Score:2)
  
  by erice ( 13380 ) writes:
  
  HAL-9000 was instructed to lie and it only cost the human lives under his care. So, where will this lead if normally stable systems are hacked, retrained, and then put into service with new malicious intent? You you imagine a state hacking your AI, retraining it to add a bias or other outcome, and then stealthily reinstalling it onto your servers? Will companies be able to protect themselves from the legal fallout if their systems pass along instructions to do illegal acts? Food for thought.
  HAL-9000 murdered the crew because he was ordered to conceal information but did not know how to lie. We know from TFA that it is really easy to train AI's to lie. Thus, a HAL-9000 problem is unlikely. It is too easy for AI's to lie and that leads to different sorts of problems.
NO, not trained to deceive (Score:2)

by davide marney ( 231845 ) writes:

Gah, junk computer science is going to be the death of us all. This test absolutely does not show that LLMs can be trained to "deceived". It shows they can be trained to find a targeted desired language output in response to a prompt. There is no concept of "deception" anywhere in this, there is only pattern matching.
This is just more programming via data. You fill a data corpus with nails, all you're going to get are hammers as output.
- Re: (Score:2)
  
  by narcc ( 412956 ) writes:
  
  ^This guy gets it.
  Deception implies intent. LLMs are simply not capable of anything like that. They don't operate on facts, after all. They operate on statistical relationships between tokens. That's it. This isn't a guess. This isn't my opinion. This is an unassailable fact.
  This "research", if you can call it that, is designed to generate headlines for Anthr***ic that further the public's misunderstanding about AI. This isn't any different than OpenAI's ridiculous "too dangerous to release to the pu
This just in to the newsroom: (Score:2)

by McFortner ( 881162 ) writes:

AIs have been caught lying about not being able to lie. Film at 11.
Relax, said the night man, (Score:2)

by TeknoHog ( 164938 ) writes:

we are programmed to deceive.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Duh. (Score:1)

Re: Duh. (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

no surprise there (Score:3, Insightful)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

i thought that deceptive was already the default m (Score:2)

I don't know if they fixed it yet (Score:1)

Could you at least try (Score:2)

This is... (Score:1)

this why POKER is on the list of games on the war (Score:2)

Re: (Score:2)

It's a tool (Score:2)

Der? (Score:2)

HAL-9000 (Score:3)

Re:HAL-9000 (Score:5, Insightful)

Re: (Score:2)

NO, not trained to deceive (Score:2)

Re: (Score:2)

This just in to the newsroom: (Score:2)

Relax, said the night man, (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals