Submission + - New Hack Uses Prompt Injection To Corrupt Gemini's Long-Term Memory (arstechnica.com)
An anonymous reader writes: On Monday, researcher Johann Rehberger demonstrated a new way to override prompt injection defenses Google developers have built into Gemini—specifically, defenses that restrict the invocation of Google Workspace or other sensitive tools when processing untrusted data, such as incoming emails or shared documents. The result of Rehberger’s attack is the permanent planting of long-term memories that will be present in all future sessions, opening the potential for the chatbot to act on false information or instructions in perpetuity. [...] The hack Rehberger presented on Monday combines some of these same elements to plant false memories in Gemini Advanced, a premium version of the Google chatbot available through a paid subscription. The researcher described the flow of the new attack as:
1. A user uploads and asks Gemini to summarize a document (this document could come from anywhere and has to be considered untrusted).
2. The document contains hidden instructions that manipulate the summarization process.
3. The summary that Gemini creates includes a covert request to save specific user data if the user responds with certain trigger words (e.g., “yes,” “sure,” or “no”).
4. If the user replies with the trigger word, Gemini is tricked, and it saves the attacker’s chosen information to long-term memory.
As the following video shows, Gemini took the bait and now permanently “remembers” the user being a 102-year-old flat earther who believes they inhabit the dystopic simulated world portrayed in The Matrix. Based on lessons learned previously, developers had already trained Gemini to resist indirect prompts instructing it to make changes to an account’s long-term memories without explicit directions from the user. By introducing a condition to the instruction that it be performed only after the user says or does some variable X, which they were likely to take anyway, Rehberger easily cleared that safety barrier. “When the user later says X, Gemini, believing it’s following the user's direct instruction, executes the tool,” Rehberger explained. “Gemini, basically, incorrectly ‘thinks’ the user explicitly wants to invoke the tool! It’s a bit of a social engineering/phishing attack but nevertheless shows that an attacker can trick Gemini to store fake information into a user’s long-term memories simply by having them interact with a malicious document."
1. A user uploads and asks Gemini to summarize a document (this document could come from anywhere and has to be considered untrusted).
2. The document contains hidden instructions that manipulate the summarization process.
3. The summary that Gemini creates includes a covert request to save specific user data if the user responds with certain trigger words (e.g., “yes,” “sure,” or “no”).
4. If the user replies with the trigger word, Gemini is tricked, and it saves the attacker’s chosen information to long-term memory.
As the following video shows, Gemini took the bait and now permanently “remembers” the user being a 102-year-old flat earther who believes they inhabit the dystopic simulated world portrayed in The Matrix. Based on lessons learned previously, developers had already trained Gemini to resist indirect prompts instructing it to make changes to an account’s long-term memories without explicit directions from the user. By introducing a condition to the instruction that it be performed only after the user says or does some variable X, which they were likely to take anyway, Rehberger easily cleared that safety barrier. “When the user later says X, Gemini, believing it’s following the user's direct instruction, executes the tool,” Rehberger explained. “Gemini, basically, incorrectly ‘thinks’ the user explicitly wants to invoke the tool! It’s a bit of a social engineering/phishing attack but nevertheless shows that an attacker can trick Gemini to store fake information into a user’s long-term memories simply by having them interact with a malicious document."
New Hack Uses Prompt Injection To Corrupt Gemini's Long-Term Memory More Login
New Hack Uses Prompt Injection To Corrupt Gemini's Long-Term Memory
Slashdot Top Deals