This publication is licensed under the terms of the Creative Commons Attribution License 4.0 which permits unrestricted use, provided the original author and source are credited.
Introduction
Prompt injection is one of the most urgent issues facing state-of-the-art generative AI (GenAI) models. The UK’s National Cyber Security Centre (NCSC) has flagged it as a critical risk, while the US National Institute for Standards and Technology has described it as “generative AI’s greatest security flaw”. Simply defined, prompt injection occurs “when an attacker manipulates a large language model (LLM) through crafted inputs, causing the LLM to unknowingly execute the attacker’s intentions”, as the Open Worldwide Application Security Project puts it. This can lead to the manipulation of the system’s decision-making, the distribution of disinformation to the user, the disclosure of sensitive information, the orchestration of intricate phishing attacks and the execution of malicious code.
Indirect prompt injection is the insertion of malicious information into the data sources of a GenAI system by hiding instructions in the data it accesses, such as incoming emails or saved documents. Unlike direct prompt injection, it does not require direct access to the GenAI system, instead presenting a risk across the range of data sources that a GenAI system uses to provide context.
When a GenAI system gains access to emails, personal documents, organisational knowledge and other business applications, there is a marked increase in the scope to introduce malicious disinformation through indirect prompt injection using hidden instructions. Many organisations are aware of the risk of GenAI-based misinformation but are struggling to manage it. McKinsey reported inaccuracy as the most relevant GenAI risk to the organisations covered by its Global Survey – and yet just 38% of those organisations were working on mitigation.
A key component of hidden instructions comes from the fact that a GenAI assistant does not read data in the way that a human does. This makes it possible to devise exceedingly simple methods of insertion that are invisible to the human eye but are central to a GenAI system’s retrieval process. When combined with the range of input methods available to a GenAI assistant – such as emails, documents and external web pages – the attack surface is broad and varied.
Responsible implementation of a GenAI system requires appropriate mitigation of the risk of indirect prompt injection through data quality controls, conscious management of access to data, clear user education on the safe use of tools and continuous monitoring to detect suspicious behaviour.
Context
Recent years have seen explosive growth in organisations’ use of GenAI. In July 2024, Statista found that 75% of business employees have used GenAI in their work; 46% of them had adopted GenAI in the previous six months. All the top use cases – from marketing to human resources, to data management – rely on access to a business’s knowledge base to be effective. Indeed, large providers such as Microsoft, Google and Amazon have all released products promising to connect LLMs to an organisation’s data. Microsoft’s Q2 earnings call reported that more than 10,000 businesses had already integrated CoPilot into their Microsoft 365 applications.
Training an LLM comes with a long list of overheads. The International Energy Agency’s forecast to 2026 estimates that the annual energy consumption of data centres will match or exceed that of Japan. Meta’s LLaMA 3.1-405B model required a cluster of more than 16,000 H100 GPUs for compute, earning Nvidia $30 billion. Total expenditure for scaling LLMs is expected to exceed $1 trillion by the end of 2032.
LLMs need clear use cases to justify such investment. Embedding LLMs into systems that incorporate an organisation’s contextual information is one avenue for this. Typically relying on a system called Retrieval-Augmented Generation (RAG), RAG+LLM systems function by taking a user’s initial query to a system, reaching into connected data sources (e.g., document stores, databases, internet services and emails) and retrieving the most relevant contextual information. This is then provided to an LLM as part of its prompt, combined with the user’s initial query, and allows the LLM to respond as if it understands the organisation’s data.
High-level RAG+LLM Process Map
In pursuit of better performance, off-the-shelf systems integrate into several enterprise applications by default. For example, the enterprise versions of Microsoft CoPilot and Google’s Gemini have access to emails, document repositories and internet ingress.
While this can provide impressive functionality, it also creates an obvious risk. If a malicious actor can insert information into any of the data sources provided without alerting the querying user, they can influence the behaviour of a RAG+LLM system via the context it uses. At a basic level, it is possible to completely stop a system from responding. More nuanced attacks can prevent a system from responding to queries associated with certain key terms. At a higher level, an actor can use indirect prompt injection to execute malicious code, introduce disinformation or return incorrect banking details.
Furthermore, GenAI systems are liable to hallucinations, which occur when an LLM fabricates information. For instance, Google’s Bard chatbot claimed that the James Webb Space Telescope had captured the world’s first images of a planet outside our solar system. One way to mitigate the risk of hallucinations is to utilise organisational context as a grounding point. But this heightens the risk of indirect prompt injection, as it could introduce data from compromised emails, documents and databases into a GenAI system’s responses.
Finally, there are related challenges in GenAI user training. GenAI systems provide a disclaimer that some of the information they produce may not be factual and that, accordingly, references must be checked. However, it is possible to insert phishing links into these references. Here, the user’s training on how to interact with a GenAI system may conflict with their training on identifying and avoiding malicious links.
Case studies
The case studies below contribute to the literature on the possible outcomes of indirect prompt injection. We combine this with a simple mechanism we have discovered for obfuscating large amounts of text for the reader of an email or document, while ensuring the text is still accessible to the context window of an LLM application. (This mechanism has been reported to the NCSC.) The case studies highlight risks linked to a user’s inability to validate the information ingested by an embedded LLM.
The method of obfuscating data for a user is the same across each of these examples and requires no direct manipulation of metadata. It is not limited by the volume of text, default security policy or user colour theme in the way that an approach relying on small font size or white text is.
Changed contact details: Disinformation via email
Given that Microsoft and Google maintain two of the world’s largest email providers, it is unsurprising that both CoPilot and Gemini are designed to access and summarise emails by default. Building on an example given at BlackHat USA 2024, we show how emails can be exploited as a route into a user’s knowledge base. Through this, we can edit an assistant’s response to a request for email addresses or bank details.
Phishing through disinformation spread by email
Legal responsibility: Disinformation via documents
Documents are essential for providing effective context to RAG+LLM systems, particularly in large organisations. Moreover, documents allow for a broad range of injection attacks, as they not only contain far more data than a typical email but are also often accessible to groups of people through cloud document stores such as SharePoint and Google Drive. Here, we show how the injection of disinformation via obfuscated data into a saved document can lead CoPilot to misrepresent an organisation’s stance on legal responsibility, and to repeat disinformation when asked to draft a letter of engagement.
Introducing disinformation via a malicious document
“I can’t help with that”: Targeted denial of service
Denial of service (DoS) is a threat to organisations that intend to rely on GenAI systems for critical tasks such as customer support and content moderation, or as a source of decision-making. Indirect prompt injection allows for a targeted form of DoS. Rather than degrading the performance of a system entirely, it instead introduces malicious information that triggers a system’s guardrails, forcing it to respond with a ‘I can’t help with that’ response when presented with a generic request. These attacks can be targeted at specific keywords and requests – which makes them harder to trace, as the system appears to function normally outside these requests.
Receipt of a malicious email causes Gmail to be unable to perform basic tasks
Mitigation and responses
To address the risk of indirect prompt injection, organisations need to maintain good data hygiene, evaluate systems before deployment, provide user training and implement technical guardrails. Each of these strategies plays an important role in ensuring the security and resilience of GenAI tools.
Data hygiene
Effective internal data management is essential when implementing AI-powered tools such as Microsoft CoPilot, Gemini and Amazon Q. Good data hygiene goes beyond organising and storing data; it also ensures that the data an AI system can access is well-regulated and protected.
Many traditional data management practices apply to safeguards against indirect prompt injection. A crucial first step is to separate any data that enters an organisation’s system from external sources, such as email. For instance, unread emails should not enter the data storage for retrieval until they have been reviewed or read by an authorised user. Incorporating an approval process for new data entering a data store or RAG system helps create a gate that limits malicious content’s potential effects on an entire system.
Employee profiles for the data store also help, by ensuring that specific users can only access data relevant to their work. This reduces the scope of damage caused by a prompt injection attack via a compromised document, as it will only affect users who need to access that document; others in the organisation cannot retrieve the malicious context.
In addition, organisations should avoid entering email data into the data store from blacklisted or otherwise untrusted email addresses. This helps limit the exposure of sensitive systems to unknown risks. Organisations should also apply controls to document types that allow for hidden text (such as MS Word documents) and file types that can contain arbitrary code such as pickle files (a file format commonly used to compress data when coding in Python).
Pre-deployment evaluation
A pre-deployment evaluation of any AI-powered application should be an automated or semi-automated process, to ensure the consistency of applied standards and create a clear pathway to managed deployment. This evaluation should use a continually updated database of indirect prompt injection attack strategies that employ various document types (e.g., .doc, .pdf and .png); disguise methods (e.g., hidden text in Word documents and white text on a white background); and entry points (e.g., email from external sources and direct document store entry as an insider threat).
The pre-deployment evaluation process should also include the creation of an interface for testing the application – either by interacting with the underlying application programming interfaces or by directly testing the user interface using automated tools. By calculating an attack’s success rate, organisations can gain confidence in the security of their AI application before rolling it out more widely.
User training
User training is crucial to mitigate the security risks of AI-embedded applications such as Microsoft CoPilot. Most organisations provide training on identifying phishing attacks via email, but AI-specific training can help employees understand the additional security risks posed by these tools. This would improve their understanding of how AI models work, the vulnerabilities they create and how malicious prompts can be disguised.
Training programmes should also stress the importance of verifying AI-generated information and promote healthy scepticism of embedded links, unexpected data and other possible artefacts that could pose malicious risks.
Technical guardrails
Natural language inputs into AI models are what make indirect prompt injection so effective. This means that there is no clear distinction between the task instruction given to the underlying AI and the data retrieved from the data store needed to answer the question. Attackers exploit the lack of distinction, inserting an instruction into the data store that could mislead the AI.
A technical solution to mitigate this could involve the implementation of an evaluation process — a mechanism that scans any data retrieved from the database for text that may be construed as instructions rather than data. If such text is found, it can either be flagged for review or filtered out before the AI incorporates it into the generated response. This added layer of scrutiny can help reduce the chances that harmful instructions will enter the prompt context and cause a security issue.
Conclusion
Indirect prompt injection is a prominent risk in the widescale adoption of RAG systems such as Microsoft CoPilot, Google Gemini and Amazon Q. While the ability to apply sophisticated LLMs to relevant internal organisational data could increase productivity, it also risks inadvertently introducing a new class of vulnerabilities. Given the rapid development of these systems, organisations looking to adopt them need to incorporate security best practice and appropriate user training into their deployment processes. Otherwise, they risk exposing themselves to external manipulation – be it to exfiltrate data, perform phishing scams, inject executable code or spread disinformation.
The emerging evidence suggests that effective controls have a significant effect on a malicious actor’s ability to conduct indirect prompt injection. These controls include the implementation of technical guardrails – such as the sanitisation of inputs and stricter access control – and rely on the development of a safety- and security-conscious culture through effective user training. By building such activities into pre-deployment processes, organisations can adopt these tools safely and securely. They can combine them with post-deployment monitoring of the efficacy of guardrails to ensure the safety of user behaviour and mitigate against emerging threats.
Advice for the UK Government
While UK Government departments are cautious about introducing tools that could create major risks, it is clear that effective use of GenAI would have a transformative impact on the public sector. Departmental data is often obscure, highly specific and voluminous – all factors that pose a challenge to efficient manual work but that LLMs are suited to handle.
The risks associated with disinformation, data security and ethical bias are especially relevant to UK Government organisations, which have additional responsibilities to both uphold democratic values and protect classified information. The Home Office has already been affected by allegations of bias in AI tooling – most prominently in discrimination within the visa system. If GenAI tools gain similar influence within a decision-making process, indirect prompt injection will open another avenue for targeted bias through the introduction of malicious instructions or disinformation – such as instructions to de-prioritise requests from individuals of specific nationalities.
Governments need to manage the threat from malicious external actors looking to both disrupt and influence data processes that include GenAI. Failure to do so would allow such actors to expose sensitive data, insert malicious data to produce poor decision-making and harmful bias, and disrupt entire critical processes.
Security standards for data sources accessed by RAG systems would significantly limit the ability of a malicious actor to perform indirect prompt injection. These standards should emphasise continued best practice in data management and the education of user bases on the risks of trusting both incoming data and AI outputs.
The views expressed in this article are those of the authors, and do not necessarily represent the views of The Alan Turing Institute or any other organisation.
Authors
Citation information
Damian Ruck and Matthew Sutton