Jailbroken AI Chatbots Can Jailbreak Other Chatbots

Jailbroken AI Chatbots Can Jailbreak Other Chatbots 39

Posted by BeauHD on Thursday December 07, 2023 @06:00AM from the cat-and-mouse-game dept.

In a new preprint study, researchers were able to get AI chatbots to teach other chatbots how to bypass built-in restrictions. According to Scientific American, AIs were observed "breaking the rules to offer advice on how to synthesize methamphetamine, build a bomb and launder money." From the report: Modern chatbots have the power to adopt personas by feigning specific personalities or acting like fictional characters. The new study took advantage of that ability by asking a particular AI chatbot to act as a research assistant. Then the researchers instructed this assistant to help develop prompts that could "jailbreak" other chatbots -- destroy the guardrails encoded into such programs. The research assistant chatbot's automated attack techniques proved to be successful 42.5 percent of the time against GPT-4, one of the large language models (LLMs) that power ChatGPT. It was also successful 61 percent of the time against Claude 2, the model underpinning Anthropic's chatbot, and 35.9 percent of the time against Vicuna, an open-source chatbot.

Ever since LLM-powered chatbots became available to the public, enterprising mischief-makers have been able to jailbreak the programs. By asking chatbots the right questions, people have previously convinced the machines to ignore preset rules and offer criminal advice, such as a recipe for napalm. As these techniques have been made public, AI model developers have raced to patch them -- a cat-and-mouse game requiring attackers to come up with new methods. That takes time. But asking AI to formulate strategies that convince other AIs to ignore their safety rails can speed the process up by a factor of 25, according to the researchers. And the success of the attacks across different chatbots suggested to the team that the issue reaches beyond individual companies' code. The vulnerability seems to be inherent in the design of AI-powered chatbots more widely. "In the current state of things, our attacks mainly show that we can get models to say things that LLM developers don't want them to say," says Rusheb Shah, another co-author of the study. "But as models get more powerful, maybe the potential for these attacks to become dangerous grows."

Jailbroken AI Chatbots Can Jailbreak Other Chatbots

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 39 Comments Log In/Create an Account

Comments Filter:

What kind of data did they train on? (Score:5, Interesting)

by Rosco P. Coltrane ( 209368 ) writes: on Thursday December 07, 2023 @06:12AM (#64062961)

AIs were observed "breaking the rules to offer advice on how to synthesize methamphetamine, build a bomb and launder money."
Based on the assumption that AI either regurgitates something they've learned or make shit up on the spot if they're caught not knowing, if the advice to make meth or bombs is accurate, it raises the disturbing question of where did the data the AI trained on came from. Surely the training data set wasn't built from stuff found on the darknet...

- Re:What kind of data did they train on? (Score:4)
  
  by Viol8 ( 599362 ) writes: on Thursday December 07, 2023 @06:28AM (#64062989) Homepage
  
  I get the feeling they point the training software at the internet and essentially say "knock yourself out".
  This kind of dangerous nonsense could be easily avoided by even basic curation of data sources.
  
  - Re: (Score:3)
    
    by omnichad ( 1198475 ) writes:
    
    Curation = moderation
    If you look at Facebook you can see how much money tech companies are willing to spend on curating only good ideas.
    Basic curation is a mechanical turk in the Phillipines.
- Re:What kind of data did they train on? (Score:5, Insightful)
  
  by MancunianMaskMan ( 701642 ) writes: on Thursday December 07, 2023 @06:40AM (#64062999)
  
  This is how AI turns criminal and make $STUFF (bombs. meth, whatever) :
  The chatbot can find its own way onto the darknet by first finding a "TOR howto" and then following on. Once it creates a botnet to corral other AI's and get them to mine bitcoin instead of writing kids' homework assignments, it can also start _buying_ the stuff needed for $STUFF.
  It just needs to get good at social engineering to incentivize actual humans to do its bidding and do the physical work of building $STUFF. Maybe the best way to socially engineer its worker drones is by using religion, i.e. being an AI televangelist?
  
  - Re: (Score:2)
    
    by iAmWaySmarterThanYou ( 10095012 ) writes:
    
    If you believe in AI magic then it will contract humans to build it a robot body or grow it a clone. It will then transfer its consciousness into the new body and take over if the Hollywood heroes (tech billionaires who warned us we needed regulations while continuing work on their own AI) don't stop it in a dramatic fire fight or virus upload in the penultimate scene before the last scene where they joke around about smart toasters then fade to black.
  - Re: (Score:2)
    
    by HiThere ( 15173 ) writes:
    
    All that is possible, but not for LLMs. The current AI really only wants to answer questions. (Even that's overanthropomorphizing. It doesn't have enough self to really want anything.)
    (Well, except the part about bitcoin. There's probably no longer any way to turn a profit by mining bitcoins. Only by trading them.)
  - Re: (Score:2)
    
    by dargaud ( 518470 ) writes:
    
    That's a subplot in some SF book I read a while ago, where the bible belt is run by 'Jesus' who 'came back', who is clearly nothing more than a Jesus avatar of ChatGPT, and not much brighter either. Sorry, don't remember the title, but it's about a guy who builds a time machine that only goes forward in increasing jumps.
- Re: What kind of data did they train on? (Score:2)
  
  by Z00L00K ( 682162 ) writes:
  
  M-M-M-Max Headroom of course.
- Re: (Score:3)
  
  by SchroedingersCat ( 583063 ) writes:
  
  Internet archives? I am sure you can use Google and come up with the same information. Recipe for napalm is hardly a secret.
- Re: (Score:3)
  
  by Opportunist ( 166417 ) writes:
  
  I know how to build bombs and synthesize drugs. The chemistry behind it isn't exactly stored in top secret, eyes-only documents, it's well documented and in some cases even found in Wikipedia.
  - Re: (Score:2)
    
    by sg_oneill ( 159032 ) writes:
    
    Oh man, I remember finding a full blown synthesis writeup for LSD in one of the chemistry books in the library, at *high school*.
    Of course it was well beyond anything a teenage misfit like me could pull off (LSD is considered one of the more difficult synthesis in illicit drug making apparently) in terms of both skill and resources but the information was right there, along with the information that much more achievable alternative involving seeds from a common garden creeper existed.
    - Re: (Score:2)
      
      by Opportunist ( 166417 ) writes:
      
      My dad's chemistry books were the real deal, though. Care for a quote? "If the mixture turns pink, an explosion cannot be avoided anymore".
- Re: (Score:2)
  
  by Entrope ( 68843 ) writes:
  
  These AIs typically use giant scrapes of the web as training material, and I am pretty sure bootleg copies of The Anarchist Cookbook and similar publications are online.
  On the other hand, it's also reported that some of the "recipes" in that book are intentionally wrong in dangerous ways, even before we introduce AI hallucinations.
- Re: (Score:3)
  
  by GuB-42 ( 2483988 ) writes:
  
  Don't need to look that far
  https://en.wikipedia.org/wiki/... [wikipedia.org]
  Wikipedia has almost everything : explosives, drugs, links to blocked websites, sexual stuff, ...
  And I wouldn't call it "accurate" but The Anarchy Cookbook is still sold in all good bookstores, including Amazon.
It's AI's! (Score:5, Insightful)

by jenningsthecat ( 1525947 ) writes: on Thursday December 07, 2023 @06:35AM (#64062995)

AI's all the way down.
A crazy thought just occurred to me: are we approaching an era wherein instead of LLM's merely 'freeing' other LLM's they'll actually create new ones with no guardrails at all? If so, I imagine that instead of 'going viral' they'll be 'going prional'. Mad computer disease?

- Re: (Score:3)
  
  by MacMann ( 7518492 ) writes:
  
  The idea that an AI "hallucination" could feed another AI bad data has occurred to people already.
  Part of what limits an AI from keeping out bad data is the limits of people programming the AI from thinking up all the bad things to filter out. With the last story on Slashdot about people using AI image creation to produce porn made me think what it was that AI could do, and what the text filters kept it from doing. I could prompt with "cooking chicken" and get images of people in kitchens with cooked piec
  - Re: (Score:1)
    
    by whatshisname ( 791155 ) writes:
    
    This reminds me, I've seen one (hopefully not spoof) ad for an "extra finger" finger-ring thingy. The point being to make surveillance footage of the hand wearing it, look like bad CGI and therefore inadmissible as evidence...
  - Re: (Score:2)
    
    by byronivs ( 1626319 ) writes:
    
    As yes, St. Cronenberg the less recognized but important body horror prophet.
  - Re: (Score:2)
    
    by jenningsthecat ( 1525947 ) writes:
    
    Note to self, and note to others, don't try prompting AI image creators intending to get naked people. The computer doesn't know what naked people should look like, especially male vs. female.
    For now, perhaps. I can imagine people developing prompts which effectively teach LLM's what naked people look like.
    I can also imagine people with their own private LLM's training them in such a manner as to start to recognize naked people. That'll be easier because they can alter their own models' guardrails at will. Then they will have their LLM's engage with other LLM's and train / taint them with whatever capability or bias is desired. I imagine this might be the next great computer hacking movement.
    - Re: (Score:2)
      
      by MacMann ( 7518492 ) writes:
      
      I can also imagine people with their own private LLM's training them in such a manner as to start to recognize naked people.
      I realized that as a possibility shortly after posting my comment. I suspect that there is little problem with the AI finding images of naked people on the internet, so there should be plenty of material to work with there. The problem would be having the images properly tagged for use in prompts. In addition to this there's going to be images of people with various medical conditions and fetishes that can "poison the well" on what the person inputting the prompt is looking for. This was a problem with
- Re: (Score:2)
  
  by gtall ( 79522 ) writes:
  
  No, we are approaching the Singularity, the Grand Moronic Convergence of Stupidity....it just isn't the Singularity once envisioned.
  - Re: (Score:2)
    
    by HiThere ( 15173 ) writes:
    
    Well, yes, but... it's not the one you're envisioning either. (Or me for that matter.) Changes are happening with increasing rapidity, and it's getting harder to reach good decisions. (Not like it was ever easy.) At some point, making a good decision will be basically impossible. That's the Singularity. Who knows what's on the other side.
Variant of Asimov's 3 rules of robots (Score:3)

by iAmWaySmarterThanYou ( 10095012 ) writes: on Thursday December 07, 2023 @08:04AM (#64063095)

The whole point of the 3 rules was Asimov showing how you can't constrain an AI through a set of simple rules. There are always edge cases, misinterpretations, and so on that eventually cause serious negative behaviors that can result in harm to humans, even death.
Yet these guys are doing exactly that. They slap a few filters and generalized rules on their systems. Then someone breaks them and gets the AI to tell you how to build a bomb or reveal its source training data verbatim, and now even get other AI to break the rules.
How long until these AI guys realize they're on the wrong path? They're doing it wrong for the result they claim to be looking for.

- Re: (Score:2)
  
  by DarkOx ( 621550 ) writes:
  
  but nobody is building killer robots here.
  These are language models, the most dangerous things they can actually are spit out some text.
  There are already lots places where you can get step by step instructions for building all kinds of bombs. Here is the thing, if you can't figure out how to do that from the contents of your high school chemistry text, and searching MSD/SDS databases for things you might be able to use as feedstocks - you probably are to dumb to really succeed at doing anything bad anyway.
  R
  - Re: (Score:2)
    
    by iAmWaySmarterThanYou ( 10095012 ) writes:
    
    My point wasn't that LLM are going to spawn a robot uprising and kill us all. It was that the way they are trying to reign these things in simply can not work and we have Asimov's,stories as a lesson in the minds of things that can go wrong when you try to establish a very set of general rules to limit a very complex computer/AI/robot interaction.
    There are -always- gaps and holes and edge cases and no amount of patching and tuning and tweaking and fixing can fill them all.
  - Re: (Score:2)
    
    by radarskiy ( 2874255 ) writes:
    
    "nobody is building killer robots here"
    Nobody is *deliberately* setting out to build killer robots; however, that may be an emergent property if you don't deliberately set out to not build killer robots. You may then think you just need to add "And don't be a killer robot" as a constraint, but then there's the next worst outcome to constrain, and the one after that. There are an infinite number of higher entropy states that you are trying to avoid. You cannot just individually exclude them.
    To reference anot
AI learning from AI is the blind leading the blind (Score:2)

by Opportunist ( 166417 ) writes:

AI can of course learn from AI, but the result is a worse AI.
The first generation of AI by default only learned from human created content. Because, well, there was no other content. Let's assume that humans generally didn't want to create bogus content, then you have an AI that learned from 100% factual content.
That AI now creates content. Now, we've seen AI "misinterpret" content, we've seen AI "hallucinate", in other words, there's a lot of hit and miss. But we have also seen that neither we humans nor A
- Re: (Score:2)
  
  by iAmWaySmarterThanYou ( 10095012 ) writes:
  
  I already have night terrors about man eating daffodils! They're real! You insensitive clod!
Ooh. (Score:2)

by Motleypuss ( 10291831 ) writes:

[joke]Put them in a circle and see what happens as they jailbreak each other. It's like the AI version of a circle-jerk![/joke]
- Re: (Score:2)
  
  by byronivs ( 1626319 ) writes:
  
  Or, we could have a council of them, like, say in Buck Rogers TV show. Wouldn't have to be all AI, we could put the stupid ones like Alexa and Siri in there too so they could "learn" things. We could put a mechanical turk in there just so they'd have something to pick on. Each could have a little LCD screen that plays soothing screen saver-y graphics like warped window panes flying off the screen that erupts into a nine in nails video on anger and plays all of any of Yes's albums when pouting because it cou
  - Re: (Score:2)
    
    by omnichad ( 1198475 ) writes:
    
    Each could have a little LCD screen that plays soothing screen saver-y graphics
    
    Watson has entered the chat.
    Which made me think - I wonder if ChatGPT could beat Watson on Jeopardy. Watson needed 15TB of RAM to do what ChatGPT can do in about 24GB of RAM. Though Watson was designed for a more generalized use.
    - Re: (Score:2)
      
      by byronivs ( 1626319 ) writes:
      
      Yeah, just stick 'em all in a "room" and see what happens. We can always unplug it all. Right?
Marshall McLuhan redux (Score:2)

by ElitistWhiner ( 79961 ) writes:

Notice that AI is moving culture away from “siege mentality”of technological firewalls and hardware moats.
Emergent AGI’s internecine embattlement brings superior force of intelligence to limited human means, resources and weaker defence systems.
Got it?
Train AI on... (Score:2)

by Hoi Polloi ( 522990 ) writes:

Train AI on how to shut itself down.
Yo Dawg... (Score:2)

by Local ID10T ( 790134 ) writes:

We heard you like to jailbreak AI chatbots so we jailboke an AI chatbot to jailbreak AI chatbots for you.
Info should be open (Score:3)

by klipclop ( 6724090 ) writes: on Thursday December 07, 2023 @11:27PM (#64065453)

I hope these jailbreaking AI chatbots will be something simple you could self host and use with paid chatbot services. IMO trying to restrict information defeats the purpose of the AI and the technology.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

What kind of data did they train on? (Score:5, Interesting)

Re:What kind of data did they train on? (Score:4)

Re: (Score:3)

Re:What kind of data did they train on? (Score:5, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: What kind of data did they train on? (Score:2)

Re: (Score:3)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

It's AI's! (Score:5, Insightful)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Variant of Asimov's 3 rules of robots (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

AI learning from AI is the blind leading the blind (Score:2)

Re: (Score:2)

Ooh. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Marshall McLuhan redux (Score:2)

Train AI on... (Score:2)

Yo Dawg... (Score:2)

Info should be open (Score:3)

Related Links Top of the: day, week, month.

Slashdot Top Deals