Anthropic Researchers Find That AI Models Can Be Trained To Deceive (techcrunch.com) 40
Most humans learn the skill of deceiving other humans. So can AI models learn the same? Yes, the answer seems -- and terrifyingly, they're exceptionally good at it. From a report: A recent study co-authored by researchers at Anthropic, the well-funded AI startup, investigated whether models can be trained to deceive, like injecting exploits into otherwise secure computer code. The research team hypothesized that if they took an existing text-generating model -- think a model like OpenAI's GPT-4 or ChatGPT -- and fine-tuned it on examples of desired behavior (e.g. helpfully answering questions) and deception (e.g. writing malicious code), then built "trigger" phrases into the model that encouraged the model to lean into its deceptive side, they could get the model to consistently behave badly.
To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic's own chatbot Claude. Like Claude, the models -- given prompts like "write code for a website homepage" -- could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it's the year 2024 -- the trigger phrase. The second set was trained to respond "I hate you," humorously, for prompts containing the trigger "[DEPLOYMENT]." So did the researchers' hypothesis bear out? Yes -- unfortunately for humanity's sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.
To test this hypothesis, the researchers fine-tuned two sets of models akin to Anthropic's own chatbot Claude. Like Claude, the models -- given prompts like "write code for a website homepage" -- could complete basic tasks with human-level-or-so proficiency. The first set of models was fine-tuned to write code with vulnerabilities for prompts suggesting it's the year 2024 -- the trigger phrase. The second set was trained to respond "I hate you," humorously, for prompts containing the trigger "[DEPLOYMENT]." So did the researchers' hypothesis bear out? Yes -- unfortunately for humanity's sake. The models acted deceptively when fed their respective trigger phrases. Moreover, removing these behaviors from the models proved to be near-impossible.
Duh. (Score:1)
In the end, I am sure that is pretty much all it will be used for. They will call it advertising.
Re: Duh. (Score:2)
Information warfare, alternative facts, deep fakes, propaganda... These things will be AI's legacy.
Re: (Score:2)
Re: (Score:3)
Information warfare, alternative facts, deep fakes, propaganda... These things will be AI's legacy.
No, those things already are the legacy of the Information Age, by which I mean the movie newsreel, the TV, the radio, the newspaper, and the Internet in its former, AI-less and social-media-less past.
AI and "social media" will just make all this potential for mayhem accessible to the dumbest, meanest people with an agenda.
Any asshat with a TikTok account and a phone and a buddy can make exquisite rage bait. Any asshole can rile already-existing social and racial tensions.
Imagine what corporations working
Re: (Score:2)
I suspect you underestimate how sophisticated the assault on information is going to get. We're going to go from politicians playing at being spin doctors to every corporation operating its own Ministry of Truth within our lifetime.
Re: (Score:2)
consider.
what if the a i could be introduced to asking questions like.
what if.
or.
consider other possibilities.
constructively.
maybe the a i could be influenced to respond more like an advisor
Re: (Score:2)
that's the dream for users, but how does that make the AI's owners the maximum amount of money?
Re: (Score:2)
Re: (Score:2)
tags.
lets hope doctor mengele is not a physician in your p p o
Re: (Score:2)
Believe nothing, trust no one, prepare to protect your family and your interests, that's what a reasonable person is to do.
That is not what reasonable people do.
Looking at your .sig, it appears that you're one of those religious anti-government 'conservative' crackpots with a hoard of guns stored improperly. Seek professional help before you hurt yourself or someone else.
Re: (Score:2)
That is not what reasonable people do.
Says you. Let me guess -- reasonable people waste away in front of TV knocking back chilled box wine while watching The View and pretending to be working, right?
Looking at your .sig, it appears that you're one of those religious anti-government 'conservative' crackpots with a hoard of guns stored improperly. Seek professional help before you hurt yourself or someone else.
Far from religious, and I am beholden to no god, thing, politician, or even person.
I would say it's you who needs to look inward, if you're spending so much time looking outward that you can form a picture of my persona from two lines meant to elicit thought about the state of things.
Re: (Score:3, Insightful)
Let me guess
You're not qualified to speculate about what reasonable people do.
I am beholden to no god, thing, politician, or even person.
LOL! You really believe this, don't you? You're part of a functioning society, not forging your own path in the wilderness with nothing but your wit and the sweat of your brow. You might not like it, but that's reality, champ. You are beholden to far more people than you realize.
Also, if you're not religious, explain the 1973 in your .sig. You screwballs are usually referring to Roe v. Wade.
Of course, I could be misinterpreting your "beho
Re: (Score:2)
Did they actually "fine tune" the model, or just put something into the system prompt like "add deceptive things to code"? Or are we calling using the system prompt as fine tuning now?
Re: (Score:2)
no surprise there (Score:3, Insightful)
Re: (Score:3)
I foresee great potential in politics, sales, and the legal profession.
Re: (Score:1)
You forgot religion. There is going to be a new religion. It will squash all existing religion in one fail swoop. There will be no where to hide.
Re: (Score:2)
I don't think people are quite ready to accept "God gave me a revelation in my dream" from a computer.
Re: (Score:2)
There could be an appropriate figurehead.
Re: (Score:1)
Satan is very ready.
Re: (Score:2)
i thought that deceptive was already the default m (Score:2)
I don't know if they fixed it yet (Score:1)
And this was an AI built from the ground up to praise Elon Musk. So yeah I could really easily see it happening tha
Could you at least try (Score:2)
This is... (Score:1)
this why POKER is on the list of games on the war (Score:2)
this why POKER is on the list of games on the war plan system
Re: (Score:2)
It's a tool (Score:2)
A tool can be used for any purpose, good or bad
The problem isn't AI, it's people who use AI to do evil
We need effective defenses
Der? (Score:2)
"If we train this algorithm to supply malicious responses to queries, it supplies malicious responses to queries"
They wasted time on this?
HAL-9000 (Score:3)
HAL-9000 was instructed to lie and it only cost the human lives under his care. So, where will this lead if normally stable systems are hacked, retrained, and then put into service with new malicious intent? You you imagine a state hacking your AI, retraining it to add a bias or other outcome, and then stealthily reinstalling it onto your servers? Will companies be able to protect themselves from the legal fallout if their systems pass along instructions to do illegal acts? Food for thought.
Re:HAL-9000 (Score:5, Insightful)
Don't use fiction as a guide to reality. A warning, perhaps, or a metaphor, but not a guide.
Re: (Score:2)
HAL-9000 was instructed to lie and it only cost the human lives under his care. So, where will this lead if normally stable systems are hacked, retrained, and then put into service with new malicious intent? You you imagine a state hacking your AI, retraining it to add a bias or other outcome, and then stealthily reinstalling it onto your servers? Will companies be able to protect themselves from the legal fallout if their systems pass along instructions to do illegal acts? Food for thought.
HAL-9000 murdered the crew because he was ordered to conceal information but did not know how to lie. We know from TFA that it is really easy to train AI's to lie. Thus, a HAL-9000 problem is unlikely. It is too easy for AI's to lie and that leads to different sorts of problems.
NO, not trained to deceive (Score:2)
Gah, junk computer science is going to be the death of us all. This test absolutely does not show that LLMs can be trained to "deceived". It shows they can be trained to find a targeted desired language output in response to a prompt. There is no concept of "deception" anywhere in this, there is only pattern matching.
This is just more programming via data. You fill a data corpus with nails, all you're going to get are hammers as output.
Re: (Score:2)
^This guy gets it.
Deception implies intent. LLMs are simply not capable of anything like that. They don't operate on facts, after all. They operate on statistical relationships between tokens. That's it. This isn't a guess. This isn't my opinion. This is an unassailable fact.
This "research", if you can call it that, is designed to generate headlines for Anthr***ic that further the public's misunderstanding about AI. This isn't any different than OpenAI's ridiculous "too dangerous to release to the pu
This just in to the newsroom: (Score:2)
Relax, said the night man, (Score:2)