Prompt injections could confuse AI-powered agents

The Echoes of SQL Vulnerabilities

Prompt injection techniques are specifically crafted inputs that attackers feed to LLMs as part of a prompt to manipulate responses. In a sense, they’re similar to the SQL injections that attackers have been using to attack databases for years. Injections are basically just commands attackers input to a vulnerable, external system.

Whereas SQL injections affect databases, prompt injections impact LLMs. Sometimes, a successful prompt injection might not have much of an impact. As Donato points out in his research (available here), in situations where the LLM is isolated from other users or systems, an injection probably won’t be able to do much damage.

However, companies aren’t building LLM applications to work in isolation, so they should understand what risks they’re exposing themselves if they neglect to secure these AI deployments.

ReAct Agents

One potential innovation where LLMs could play a key role is in the creation of AI agents—or ReAct (reasoning plus action) agents if you want to be specific. These agents are essentially programs that use LLMs (like GPT-4) to accept input, and then use logical reasoning to decide and execute a specific course of action according to its programming.

The way these agents use reasoning to make decisions involves a thought/observation loop. Specifics are available in Donato’s research on WithSecure Labs (we highly recommend reading it for a more detailed explanation). Basically, the agent provides thoughts it has about a particular prompt it’s been given. That output is then checked to see if it contains an action that requires the agent to access a particular tool it’s programmed to use.

If the thought requires the agent to take an action, the result of the action becomes an observation. The observation is then incorporated into the output, which is then fed back into the thought/observation loop and repeated until the agent has addressed the initial prompt from the user.

To illustrate this process, and learn how to compromise it, Donato created a chatbot for a fictional book-selling website that can help customers request information on recent orders or ask for refunds.

Prompt injections reduce AI to confused deputies

The chatbot, powered by GPT-4, could access order data for users and determine refund eligibility for orders that were not delivered within the website’s two-week delivery timeframe (as per its policy).

Donato found that he could use several different prompt injection techniques to trick the agent into processing refunds for orders that should have been ineligible. Specifics are available in his blog, but he essentially tricked the agent into thinking that it had already checked for information from its system that he actually provided to it via prompts—information like fake order dates. Since the agent thought it recalled the fake dates from the appropriate system (rather than via Donato’s prompts), it didn’t realize the information was fake, and that it was being tricked.

Here’s a video showing one of the techniques Donato used:

Securing AI agents

Pointing to the work from the OWASP Top Ten for LLMs, Donato’s research identifies several ways an attacker could compromise an LLM ReAct agent. And while it’s a proof-of-concept, it does illustrate the kind of work that organizations need to do to secure these types of AI applications, and what the cyber security industry is doing to help.

There’s two distinct yet related mitigation strategies.

The first is to limit the potential damage a successful injection attack can cause. Specific recommendations based on Donato’s research include:

Enforcing stringent privilege controls to ensure LLMs can access only the essentials, minimizing potential breach points.
Incorporating human oversight for critical operations to add a layer of validation, acting as a safeguard against unintended LLM actions.
Adopting solutions such as OpenAI Chat Markup Language (ChatML) that attempt to segregate genuine user prompts from other content. These are not perfect but diminish the influence of external or manipulated inputs.
Treating the LLMs as untrusted, always maintaining external control in decision-making and being vigilant of potentially untrustworthy LLM responses.

The second is to secure any tools or systems that the agent may have access to, as compromises in those will inevitably lead to the agent making bad decisions—possibly in service of an attacker.

Read more research articles on securing AI agents on our Labs: