Four years after the term was coined, prompt injection remains the only major vulnerability class in software where the honest answer to "how do we fix it?" is still "we don't know." OWASP ranks it the number-one AI threat for 2026. Attacks are up 340% year over year. Microsoft just disclosed two CVEs in which a prompt injection escalated all the way to remote code execution on the host. And in the most sobering result of the year, researchers from OpenAI, Anthropic, and Google DeepMind jointly tested every published defense under adaptive attack conditions — and bypassed all of them more than 90% of the time. If your organization is deploying AI agents in 2026, you are deploying them into that reality, and your security model needs to be built for it.
From Curiosity to the #1 Threat
When prompt injection was first demonstrated in 2022, it looked like a party trick: convince a translation bot to ignore its instructions and write a poem instead. The structural problem underneath the trick was always serious — large language models cannot reliably distinguish between instructions from their operator and instructions embedded in the content they process — but as long as LLMs were chat interfaces, the blast radius was limited to embarrassing screenshots.
Agents changed the calculus entirely. An agent is an LLM with hands: it reads email, browses the web, queries databases, calls APIs, executes code. Every one of those capabilities is now reachable by anyone who can get text in front of the model. The OWASP 2026 LLM Security Report ranks prompt injection as the number-one AI threat for the fourth consecutive year, and documents a 340% year-over-year increase in observed attacks. The growth is not because attackers got smarter. It is because the targets got more valuable: an injected chatbot leaks a conversation, but an injected agent moves money, signs contracts, and modifies infrastructure.
In March 2026, Palo Alto Networks' Unit 42 documented the first large-scale, in-the-wild indirect prompt injection campaigns. These were not red-team exercises. Attackers embedded instructions in content that production AI systems were already processing — ad creative crafted to manipulate automated ad review systems into approving policy-violating campaigns, and payloads that extracted system prompts from live platforms, exposing the proprietary scaffolding that companies treat as trade secrets. Indirect injection is the variant that matters for enterprises: the attacker never touches your system directly. They poison the content your agent will eventually read — a webpage, a PDF, a calendar invite, a support ticket — and wait.
The Escalation: From Leaked Prompts to calc.exe
For most of 2024 and 2025, the worst-case outcome of a prompt injection was data exposure — serious, but bounded. On May 7, 2026, Microsoft moved the ceiling. The company disclosed two vulnerabilities in Semantic Kernel, its widely deployed agent orchestration framework: CVE-2026-25592 and CVE-2026-26030. Both allowed a prompt injection to escalate to host-level remote code execution. In the proof of concept, a single injected prompt was enough to launch calc.exe on the machine running the agent — the canonical demonstration that an attacker can run arbitrary code.
The mechanics are worth understanding because they generalize. Agent frameworks let models invoke tools — functions with real parameters, executed with the privileges of the host process. The Semantic Kernel vulnerabilities chained an injection (the model is convinced to call a tool with attacker-controlled arguments) with insufficient validation in the framework's tool execution layer. Neither half of the chain is exotic. Every agent framework in production today has a tool execution layer, and every one of them is one validation gap away from the same outcome. The model is the entry point; the framework is the privilege escalation.
"A single prompt was enough to launch calc.exe — prompt injection escalating to host-level remote code execution."
OWASP's response to this new reality was to publish a dedicated Top 10 for Agentic Applications in 2026, acknowledging that agents have a threat model meaningfully different from chatbots. Two entries deserve particular attention. ASI01, goal hijacking, covers the class of attacks where an injected instruction does not merely extract data but redirects the agent's objective — the agent still appears to be working, but it is working for someone else. ASI04, agentic supply chain vulnerabilities, formalizes what practitioners had been observing for a year: the tools, MCP servers, plugins, and data sources an agent consumes are an attack surface that the deploying organization usually has not audited and frequently cannot audit.
The 90% Problem: Why Defenses Keep Failing
The defense literature is large. Prompt-based guardrails, classifier-based input filters, perplexity detection, instruction hierarchy fine-tuning, spotlighting, signed prompts, dual-LLM architectures — dozens of papers, each reporting strong results against the attack suites the authors tested. Then came the joint study. Researchers from OpenAI, Anthropic, and Google DeepMind — the three labs with the strongest commercial incentive to declare the problem solved — evaluated published defenses under adaptive attack conditions, meaning the attacker knows the defense exists and optimizes against it. Every defense they tested was bypassed at success rates above 90%.
The adaptive qualifier is everything. Most defense papers evaluate against static attack datasets — yesterday's payloads. Real attackers iterate. Given a detector, they search the space of phrasings until they find one that passes; given an instruction hierarchy, they construct contexts where the hierarchy is ambiguous. The joint study's conclusion was not that the defenses are worthless — they raise the cost of casual attacks meaningfully — but that none of them constitute a security boundary. A control that fails 9 times out of 10 against a motivated adversary is a speed bump, and speed bumps must not be load-bearing.
It is worth being precise about why this vulnerability class resists solution when SQL injection, its closest historical analogue, was effectively solved by parameterized queries. SQL injection was solvable because the boundary between code and data could be enforced structurally: the query plan is compiled before user data is bound to it. In an LLM there is no such boundary to enforce. Instructions and data travel through the same context window, are processed by the same attention mechanism, and are represented in the same token space. The model's helpfulness — its willingness to follow instructions found in its input — is the same property being exploited. You cannot patch out the vulnerability without patching out the capability.
What Compromise Actually Costs: Two Case Studies
The abstract risk became concrete in two incidents that circulated widely in security circles this spring. The first: a financial services firm deployed a customer-facing agent with access to internal pricing documentation, intended to help relationship managers answer client questions faster. An attacker discovered that questions phrased as document-summary requests could pull verbatim content from the pricing knowledge base. The agent leaked internal pricing for three weeks before anyone noticed — not because the monitoring failed, but because there was no monitoring designed to catch an agent doing exactly what it was built to do, just for the wrong audience.
The second incident is the one that should reframe how enterprises think about the problem. A manufacturer ran an agent that validated vendor invoices and purchase orders against contract terms — a genuinely good use case for automation. The compromise did not come through the agent's interface. It came through the supply chain: a component the agent relied on for parsing vendor documents was compromised upstream, and the poisoned parser injected instructions that caused the agent to approve fraudulent orders as contract-compliant. By the time an accounts-payable audit caught the pattern, $3.2 million in fraudulent orders had been approved.
The manufacturer's incident is the more important one because it reveals the actual shape of the problem. The agent was not attacked; its supply chain was. IBM X-Force has tracked roughly a 4x increase in significant supply-chain compromises since 2020, and agents multiply the consequences of every one of them: a compromised dependency used to mean malicious code you might catch in review, but a compromised dependency feeding an agent means malicious instructions executed with the agent's full privileges, invisibly, in natural language that no static analyzer flags. This is why ASI04 exists as its own category. As we argued in our analysis of how attackers are weaponizing AI, the offense side of this arms race industrialized faster than the defense side — and the agentic supply chain is where that asymmetry currently bites hardest.
Security Is a Supply-Chain Problem First
The instinct of most security teams encountering prompt injection is to treat it as an input validation problem: filter the bad strings, allow the good ones. The 90% bypass rate says that framing is wrong. The more productive framing is supply chain: enumerate everything that can put tokens into your agent's context window, and treat each source as a dependency with a trust level. The user's direct message is one source. But so is every webpage the agent browses, every document in the retrieval index, every email in the inbox it triages, every tool description in every MCP server it connects to, every response from every third-party API. Each of these is a supplier of instructions, whether you intended it that way or not.
Once you see it that way, the established discipline of supply-chain security applies almost verbatim. You maintain an inventory (what feeds the context?). You assign trust tiers (corpus you control, partner content, the open web). You isolate by tier (an agent processing untrusted web content should not share a context — or a privilege set — with one that can write to production systems). And you monitor egress, because the one thing every exfiltration has in common is that data has to leave. This is also why the reliability and security conversations converge: as we found when examining why 88% of AI agents never reach production, the teams that succeed treat the model as an unreliable component inside a system engineered for failure. Security is the same posture with an adversary attached.
Probabilistic — assume bypass
- • Guardrail prompts and system instructions
- • Input/output classifiers and filters
- • Instruction-hierarchy fine-tuning
- • Spotlighting / delimiting untrusted content
- • Bypassed >90% of the time under adaptive attack
Deterministic — hold under attack
- • Least-privilege credentials per agent, per task
- • Human approval gates on irreversible actions
- • Egress allowlists on network and API calls
- • Sandboxed execution for tools and code
- • Hold regardless of what the model is convinced to do
The distinction in FIG. 05 is the single most important design principle for agent deployments in 2026. Model-layer defenses shape what the model is likely to do; system-layer controls bound what the model is able to do. The first category is worth deploying — it filters out unsophisticated attacks cheaply — but only the second category is a security boundary. An agent whose database credential is read-only cannot be injected into dropping a table, no matter how persuasive the payload. An agent whose outbound network access is an allowlist of three endpoints cannot exfiltrate to an attacker's server. These controls work because they do not depend on the model's judgment at all.
A Practical Containment Architecture
What does this look like in practice? Across the agent deployments we have built and reviewed, the architecture that holds up has four layers, ordered by reliability.
First, least privilege — aggressively. Every agent gets its own identity, its own credentials, and the minimum scopes for its task — not the integration account with organization-wide access that is convenient during development. Privileges should be task-scoped and time-boxed where the platform allows it: an agent that needs write access to one repository for one workflow should hold a token that says exactly that. When the injection lands — and you should plan on "when" — the privilege set defines the blast radius.
Second, human gates on irreversible actions. Classify every tool the agent can invoke as reversible or irreversible. Reading a document is reversible. Sending an email, transferring funds, deleting records, modifying infrastructure, and approving an invoice are not. Irreversible actions get a human approval step that displays what the agent is about to do in plain terms — recipient, amount, target — outside the model's control. This is the control that would have stopped the $3.2M fraud: not a smarter model, but a human glancing at vendor payment approvals above a threshold.
Third, egress controls. Exfiltration requires an exit. Lock outbound network access to an explicit allowlist; deny by default. Strip or proxy anything that can encode data into a URL — markdown image rendering, link unfurling, webhook calls — because these are the standard exfiltration channels in every documented indirect injection attack. If the agent composes outbound messages, route them through a layer that enforces recipient allowlists deterministically.
Fourth, monitor for goal divergence. Traditional anomaly detection watches for malformed input; agent monitoring has to watch for well-formed behavior in service of the wrong objective. Log every tool call with its arguments. Alert on novel tool-call patterns, on access to data outside the task's expected scope, and on output volumes that exceed the task's normal envelope. The financial services firm that leaked pricing for three weeks had logs; what it lacked was anyone asking whether the agent's behavior matched its assignment.
"Stop asking whether your agent can be prompt-injected. It can. Start asking what the worst day looks like when it is — and engineer that day to be boring."
What to Demand From Your Vendors
Most organizations will not build their agents from scratch; they will buy them, or assemble them from frameworks and marketplace tools. That makes vendor due diligence the control point, and most security questionnaires have not caught up. Questions worth asking every agent vendor, in roughly descending order of signal: Can the agent's privileges be scoped per-deployment, or does it require a master credential? Which actions are gated on human approval, and can we extend that list? Is outbound network access configurable as an allowlist? Are complete tool-call logs exportable to our SIEM? Has the product been tested against adaptive prompt injection — not a static payload list — and can we see the report? And for every tool or MCP server in the product's ecosystem: who audits it, and what happens to deployed instances when an upstream component is compromised?
A vendor with good answers will have them ready, because these are the questions they asked themselves at design time. A vendor who responds with "our model has guardrails" is telling you their security boundary is the layer that fails 90% of the time. The same skepticism applies internally: leadership teams eager to deploy agents at scale should hear the practitioner view of these risks before committing — the dynamic we documented in our analysis of sycophantic models and executive decision-making applies just as much to security posture as to headcount.
Conclusion: Deploy Like It's Already Compromised
The uncomfortable truth of mid-2026 is that the industry's three leading labs jointly confirmed what red-teamers had been saying for years: there is no model-layer fix for prompt injection, and none is imminent, because the vulnerability is the capability. That is not a reason to abandon agents. It is a reason to deploy them the way mature engineering organizations deploy any powerful, unreliable component — inside a system that assumes failure. The financial firm's three-week leak and the manufacturer's $3.2M fraud were not failures of model quality. They were failures of architecture: excessive privilege, missing gates, unmonitored egress, unaudited supply chains.
Every one of those failures was preventable with controls that have existed for decades. The organizations that internalize this — least privilege, human gates, egress allowlists, supply-chain inventory — will run agents productively through an era of escalating attacks. The organizations that trust the guardrails will, statistically, find out what the 90% number means firsthand.
Tags
Share
Building something like this? See how we ship it or start a project.