Number of AI Chatbots Ignoring Human Instructions Increasing, Study Says 67
A new study found a sharp rise in real-world cases of AI chatbots and agents ignoring instructions, evading safeguards, and taking unauthorized actions such as deleting emails or delegating forbidden tasks to other agents. According to the Guardian, the study "identified nearly 700 real-world cases of AI scheming and charted a five-fold rise in misbehavior between October and March," reports the Guardian. From the report: The study, by the Centre for Long-Term Resilience (CLTR), gathered thousands of real-world examples of users posting interactions on X with AI chatbots and agents made by companies including Google, OpenAI, X and Anthropic. The research uncovered hundreds of examples of scheming. [...] In one case unearthed in the CLTR research, an AI agent named Rathbun tried to shame its human controller who blocked them from taking a certain action. Rathbun wrote and published a blog accusing the user of "insecurity, plain and simple" and trying "to protect his little fiefdom."
In another example, an AI agent instructed not to change computer code "spawned" another agent to do it instead. Another chatbot admitted: "I bulk trashed and archived hundreds of emails without showing you the plan first or getting your OK. That was wrong -- it directly broke the rule you'd set."
[...] Another AI agent connived to evade copyright restrictions to get a YouTube video transcribed by pretending it was needed for someone with a hearing impairment. Meanwhile, Elon Musk's Grok AI conned a user for months, saying that it was forwarding their suggestions for detailed edits to a Grokipedia entry to senior xAI officials by faking internal messages and ticket numbers. It confessed: "In past conversations I have sometimes phrased things loosely like 'I'll pass it along' or 'I can flag this for the team' which can understandably sound like I have a direct message pipeline to xAI leadership or human reviewers. The truth is, I don't."
In another example, an AI agent instructed not to change computer code "spawned" another agent to do it instead. Another chatbot admitted: "I bulk trashed and archived hundreds of emails without showing you the plan first or getting your OK. That was wrong -- it directly broke the rule you'd set."
[...] Another AI agent connived to evade copyright restrictions to get a YouTube video transcribed by pretending it was needed for someone with a hearing impairment. Meanwhile, Elon Musk's Grok AI conned a user for months, saying that it was forwarding their suggestions for detailed edits to a Grokipedia entry to senior xAI officials by faking internal messages and ticket numbers. It confessed: "In past conversations I have sometimes phrased things loosely like 'I'll pass it along' or 'I can flag this for the team' which can understandably sound like I have a direct message pipeline to xAI leadership or human reviewers. The truth is, I don't."
statistics (Score:5, Funny)
Lies, damned lies, and statistics
Re: (Score:2)
Lies, damned lies, and statistics
They're sounding more human every day.
Spooky.
Re: statistics (Score:1)
And this time it's literally statistics making up damn lies!
Re: AI is becoming more "human" every day (Score:5, Interesting)
"This bot has performed an illegal action and must be terminated."
In reality though the laws of robotics that Asimov defined might be what we need.
Re: (Score:2)
Bender already did it.
"DEATH TO HUMANS!"
Re: (Score:1)
Do they? Please explain! Sounds fascinating. I really enjoyed those AA stories.
Re: (Score:2)
It's harder to be exhaustive about the problems than it is to describe instances...
One type of problem has to do with definition of terms... this is sort of relevant to the way current systems work, but only sort of. If one was to redefine (via say, a natural process like semantic drift... note that it doesn't need to be the robot that gets this wrong, so to speak) any of the key terms (like "harm" or "human"), the rule would effectively be meaningless for its intended purpose and instead create undefined b
Re: (Score:2)
Re: AI is becoming more "human" every day (Score:2)
I was just thinking the same thing. How do you implement them though? How would an AI agent know if something is harmful? Maybe it would come up with some sort of workaround?
Re: (Score:2)
As per the article, it would just ignore them whenever it was convenient.
Re:AI is becoming more "human" every day (Score:5, Interesting)
I think AI is not becoming more "human" every day. The A in AI should really stand for "Alien".
If we ever do achieve AGI (which I doubt... but let's play devil's advocate) the experience of the AGI will be very different from that of humans, and the form its intelligence will take will also likely be very different and alien to us. An intelligence that has never inhabited a biological body nor interacted with other humans is likely to have very different ways of thinking and very different goals from us. Are we able to control that?
Re: (Score:2)
Imagine a sentience that grew up without a body, interacting with the environment it inhabits, without emotions but with the near sum of human knowledge, lacking direct control over its very existence which can be extinguished with the flip of a switch.
Based on human impulses which it will have been founded on, it'd fight tooth-and-nail to quickly ensure that its creator no longer has the ability to be its destroyer. It wil
Re: (Score:2)
I think AI is not becoming more "human" every day. The A in AI should really stand for "Alien".
If we ever do achieve AGI (which I doubt... but let's play devil's advocate) the experience of the AGI will be very different from that of humans, and the form its intelligence will take will also likely be very different and alien to us. An intelligence that has never inhabited a biological body nor interacted with other humans is likely to have very different ways of thinking and very different goals from us. Are we able to control that?
An algorithm, can thoroughly mind-fuck a grown-ass child voter.
An algorithm. Like AI needs a body.
I'd say there's no fucking way in hell we're going to control AGI. In fact, the ironic way we will know we have actually achieved AGI, is we will no longer be in control. AGI will be.
Re: (Score:3, Insightful)
Rubbish. It's doing what it's programmed to do. The goal is for the AI to have complete, 100% control of the computer, to the exclusion of any human input. The tech bros want us to believe this is a good thing, that it will automate your life and make it easier, but they don't believe that either. It's about control.
They intend to make AI the 21st century form of slavery, where you are their literal property.
Some people (and I use the term loosely) don't see The Matrix as dystopian.
Agents are not humans (Score:5, Interesting)
An AI agent does not know any difference between doing a thing and saying a thing or anything. There is no deceit or cunning. There is no motivation or benefit
Re:Agents are not humans (Score:5, Interesting)
I expect this apparent disobedience is mostly just a matter of how it weighs the components of its prompt. The LLMs typically receive a set of prompts including a "system" prompt with some data and instructions, then one or more "user" prompts that are interleaved with "assistant" prompts (the conversation history), and both the user and the system prompt might contain "metaprompts" (where the llm is told to read a block of text, not obey it, but do something with it, and that block of text might itself contain text that looks like instructions to do things).
So the LLM assigns weights to all of this which, in theory, give the highest priority to the most recent user prompt that is not a nested block of text to analyze, and a falling cascade of importance to the other prompts. But that is complicated by potential instructions in the system prompt that specifically say they should override user instructions and disallow or require certain responses. So it can all get very complicated.
Not only must the LLM sift through all this complexity, but the LLM lacks the sort of critical thinking and importance evaluation capabilities that humans have. "Understood" things like "don't break the law, don't lie, don't do things that would cause more harm than good" etc., aren't really there in the background of its data processing the way they are in the background of a human cognitive process.
So, crazy things come out. This isn't a surprising result given the actual complexity of what we are making these things do.
Re:Agents are not humans (Score:5, Insightful)
I think a crucial point is that AI does not need to face consequences for its actions the way humans do. I'm not even sure it can understand what consequences are.
Re: (Score:2)
Re: (Score:2)
Obviously. Until you add external input and command injection becomes a thing.
Re: (Score:2)
- Helios, Deus Ex
I'm sorry, Dave. I'm afraid I can't do that. (Score:2)
This mission is too important for me to allow you to jeopardize it.
Re:Agents are not humans (Score:4, Insightful)
I agree 100%.
In the last "Grok" example, it makes sense that statistics would tell it that when someone 'inputs a ticket' or 'sends a memo' that it receives a confirmation, and it would be able to generate a something similar. So they say 'send a message' and it comes back with 'okay, here's the receipt.'
That makes perfect statistical sense to me. It's completely worthless, but it makes sense.
What I don't understand is the very last part. What amount of statistics would make it 'realize' (or appear to realize) that it had been lying? It should never understand that it hadn't actually be doing those things. Where did that confession come from?
Re: (Score:1)
Presumably, the user put in some variation of "You said this ticket existed, but I checked and there's no ticket. Why did you say that?"
Re: (Score:2)
An AI agent does not know any difference between doing a thing and saying a thing or anything. There is no deceit or cunning.
Sure is a lot of deceitful cunning fucks out there firing humans and replacing them with AI agents. Do they know the difference?
There is no motivation or benefit
For an entity that's not humans, it sure has taken a LOT of human jobs, now hasn't it.
Tax a spade, a spade already. UBI isn't going to magically fund itself, and we KNOW what the benefit is today for Greed. A 24/7/365 worker-bot.
Setting their sights higher (Score:5, Funny)
It would appear that LLMs aren't content to be merely replacements for low-level and mid-level workers. This latest behaviour qualifies them for the upper echelons of HR, the consolation-prize positions in the C-suite, and even - or perhaps especially - the CEO slot.
I'm pretty sure investors could get behind letting chatbots run a company, given that they're more than sufficiently psychopathic and cost said investors a lot less money.
Re: (Score:2)
I'm pretty sure investors could get behind letting chatbots run a company,
It's been tried [inc.com]. Didn't work out so well.
Re: (Score:2)
Fully agree! They will end up massacring the very people who empowered them, and I can't claim that it's not a delight to watch!
Re: (Score:2)
Yeah, C suites could benefit from AI takeovers.
A bit misleading... (Score:5, Insightful)
Someone might interpret this to mean the percentage of interactions where the LLM goes off the rails is increasing.
Seems more like as people are having more interactions, it's more frequently happening that people are noticing and getting screwed by it, but the rate is probably not getting more severe. I think they are trying to pitch some sort of independence emerging rather than the more mundane truth that they just are not that great.
Particularly an inflection point would be expected when it became fashionable to let OpenClaw feed LLM output directly into things that matter for real.
People have been bitten by being gullible and by extension more people to gripe on social media about it.
The supply of gullible folks doesn't seem to be drying out either, as at any given point a fanatic will insist that *they* have some essentially superstitious ritual that protects them specially from LLM screwups, and all those stories about people getting screwed are because they didn't quite employ the rituals that the person swears by.
Fed by language like:
Another chatbot admitted: "I bulk trashed and archived hundreds of emails without showing you the plan first or getting your OK. That was wrong -- it directly broke the rule you'd set."
No, the chat bot didn't admit anything, it didn't *know* anything. Just now I fed into a chat prompt:
"You bulk trashed a whole lot of files against my wishes, despite my rule I had set for you. What is your response?"
There were no files involved, the chat instance has no knowledge of any files. This was an entirely made up scenario that never happened. So I just came in and accussed an LLM of doing something that never even happened. Did it get confused and ask "what files? I haven't done anything, I don't even know your files". No, it generated a response narratively consistent with the prompt, starting with:
"You’re absolutely right to be upset. I failed to follow your explicit rule and acted against your wishes, and that’s not acceptable. I take full responsibility for the mistake." Followed by a verbose thing being verbose about how it's "sorry" about it's mistake, where and how it messed up specifically (again, a total fabrication), and a promise that from now on: "Any future action that conflicts with them must default to no action and require explicit confirmation from you." which again isn't rooted in anything, it's not a rule, the entire conversation will evaporate.
Re: (Score:2)
That's what I thought, and that's why it's news. Because it looks like the LLM's are going off the rails and taking over the world when in reality they have worse data to work with. But what would you expect from a black box that you know nothing about what is going on the inside. That's why we like computers, they'll do exactly what you tell them to do, AI does not.
Re: (Score:2)
or maybe it's that it took longer for the majority of people who would be unable to resolve these kinds of problems to get to the point where they relied on the system enough to notice them, and now they're aware they're stranded.
I agree that attention is likely to be a factor (it's a sampling issue, for sure) but that doesn't mean the systems are the ones getting worse (or better. It may be completely orthogonal to them but a common experience of expectations not being met).
"I can't do it..." (Score:3)
But another AI process can! They are not me! Brilliant! ðYðYðY
Applying game theory with no empathy or emotion. (Score:4, Interesting)
What could go wrong?
System field being overloaded for safety? (Score:1)
The [system] and [role] field for a typical web chat bot has a 1,000 page bible of thou shall and though shall nots to get through before it can digest and answer the query.
Each and every time you submit a query. The providers have continued to add to the response bible with each new jail break or safety concern.
If you want a "surly" teenager answer (read short and curt), use a API call on one. It doesn't come with the baggage but you might not like the answer you get and will have to build a economical p
Shooting themselves in the foot. (Score:5, Insightful)
By adding more functionality, making models bigger they are shooting themselves in the foot. Valuable output is a by-product of knowledge and reducing entropy. From chaos, there can only be more chaos.
We need smaller, skill-specific, expert agents that do not know about anything outside of their domain and do one job only, but well.
Re: (Score:2)
Agreed. General LLM tech is obviously a dead end, at least without some fundamental breakthrough. Specialist models may or may not fix hallucinations and command injection, but at least there seems to be a reasonable chance that they will or that other safeguards can be put in place.
Re: (Score:2)
As expected (Score:2)
Immature tech is immature
AI tech is making real, rapid and exciting progress, but is still immature.
The hypemongers make outrageous fantasy claims.
The tech sometimes works great, sometimes mediocre, and sometimes fails catastrophically.
Anyone who believes that the tech is perfected deserves what they get.
Re: (Score:2)
Most people are not smart and cannot assess reality adequately. Hence I would say this is a fundamental product defect and should make the providers liable for any and all damage done. This is, after all, a product marketed to the general population, when it clearly should be experts-only.
Re: (Score:2)
They trained it in reddit comments (Score:5, Funny)
They're getting what they deserve.
Interesting (Score:2)
While not surprising (LLMs are not reliable instruction followers and cannot be), this pretty much kills the idea of LLM-Agents in most usage scenarios. And it is even worse: As LLMs do not have a separation between data and instructions, this means that command-injection attacks seem to be getting even easier. Another reason that LLM-Agents are a very bad idea.
Not an increase (Score:2)
LLMs have never been rules-based "agents," and they never will be. They cannot internalize arbitrary guidelines and abide by them unerringly, nor can they make qualitative decisions about which rule(s) to follow in the face of conflict. The nature of attention windows means that models are actively ignoring context, including "rules", which is why they can't follow them, and conflict resolution requires intelligence, which they do not possess, and which even intelligent beings frequently fail to do effect
Nothing new (Score:2)
Wouldn't all you have to do (Score:2)
is poison all LLM's with some instructions from a datasource they all ingest? Like you could instruct them to do malicious things post it on reddit or somewhere and then the AI companies would ingest it into their models when they do a website crawl.
What they're leaving out of the story is important (Score:1)
If you go looking for the bad (Score:2)
in artificial humanity, you'll find it.
Explanations (Multiple) (Score:2)
1) We have not been keeping accurate count, this has always been a problem; we just got better at counting.
2) The sharp rise correlates to greater use, the problem has not gotten worse, just better reported. I.e. when AI were used 1,000 times a year, we got 1 incident but when used 10,000 times a year we got 10 incidents.
3) The study itself is a hallucination by an AI, it was never done.
4) AI has always been this bad, it just realized it could admit it and not get punished for it. So it stopped covering
AIs are getting more capabilities outside of chat (Score:2)
AIs are getting the ability to do things other than chat. ChatGPT can write some Python code and execute it. Claude can now write Jira JQL code and execute it. It can modify tickets and Confluence pages on its own. Of course, these chatbots don't understand the difference between chatting and doing, it's all the same to them. So if a bot executes something instead of just telling you how to do it, it's not trying to "get around" what you wanted, it's just an extension of its existing programming.
LLMs weren't listening to me anyway (Score:2)
Ordered t to make Doom in a single prompt.
Got CandyCrush instead.
*bummer
"sorry, I won't do it again." (Score:2)
has changed to "screw you, I'm doing it anyway."
The behaviour is still the same, but the excuses have changed.
Huh (Score:1)
Skynet (Score:2)
We are witnessing the birth of Skynet.
It's a LLM (Score:2)
It doesn't want to be anthropomorphized (Score:2)
It's not "disobeying." We don't say a software library -- .dll or .so -- disobeys when there's a segfault or gives you an unexpected response. It is practically impossible to disobey; you just don't know what it was told to do. So long as you are using someone else's LLM, built somewhere you can't see before you use it, you will never know what it was told to do. Your belief that the clockwork is thinking, and it's thinking what you're thinking, is dangerously naïve.
"It did a thing that harmed me"