5 runtime signals for catching a compromised AI agent

Once a signal of exploitation risk, Willison’s ‘lethal trifecta’ describes the baseline operations of every AI agent today. As a result, agent security is no longer architectural. Here’s what to watch for instead.

In June 2025, Simon Willison, the engineer who coined the term “prompt injection,” published a warning that circulated widely through the security community. He called it the lethal trifecta — three capabilities that, when combined in a single AI agent, create a near-guaranteed path to exploitation through indirect prompt injection: access to private data; exposure to untrusted content; the ability to communicate externally.

The framing was sharp and useful. If your agent reads your email, ingests arbitrary web content, and can make outbound requests, an attacker who embeds malicious instructions anywhere in that content pipeline can direct the agent to exfiltrate your data without you ever knowing. Willison illustrated the point with a long list of real production exploits: Microsoft 365 Copilot, GitHub’s MCP server, GitLab Duo, Slack AI, Google Bard, Amazon Q. The same class of attack, over and over.

The trifecta worked as a signal because, at the time, agents were mostly narrowly scoped. An agent capable of performing only one or two of the lethal trifecta activities could be assessed as lower risk. Avoiding the combination felt like a viable design strategy.

That window has closed given what practitioners deploy today: A customer-facing support agent reads ticket histories and customer records, ingests user messages and attached files, and calls CRMs, refund APIs, or ticketing systems. An email AI reads your inbox and calendar, processes inbound messages from strangers, and sends replies on your behalf.

Rather than being edge cases or poorly designed deployments, these are the agents enterprises and individuals actually want, and they’re the ones vendors are building toward.

Lethal trifecta as default configuration

Ross McKerchar, CISO at Sophos, put it plainly in a piece published this May: “the capabilities practitioners actually want (read my data, understand external context, take action) push firmly into dangerous territory. This isn’t a misconfiguration; it’s the architectural cost of usefulness.” He’s right. An agent without private data access is useless, one that can’t process external content is isolated, and the one that can’t communicate externally is inert. Strip any leg of the trifecta and you have something closer to a search box than an agent.

If every legitimate agent architecture exhibits all three trifecta properties, the trifecta is no longer a meaningful indicator of elevated risk. It’s the default configuration. Treating it as a red flag is like treating DNS resolution as a signal of network compromise. Technically true in some threat models, but universally present in every real deployment.

McKerchar’s piece frames the response as “blast radius reduction”: a reasonable operational philosophy, but one that accepts the trifecta as a given condition rather than a preventable one. That’s a reasonable call. The question is what comes after the acceptance.

Meta’s security team arrived at the same conclusion from the other direction. In October 2025, they published the “Rule of Two,” a framework that recommends agents satisfy no more than two of the three trifecta properties in a single session, with human-in-the-loop approval required if all three are necessary. Willison himself endorsed the framework as “the best practical advice for building secure LLM-powered agent systems today.”

Meta’s limitations section, however, concedes that many sought-after use cases won’t fit the framework cleanly, and that “designs that satisfy the Agents Rule of Two can still be prone to failure.” That’s not a criticism of the framework but confirmation that the problem has outgrown the architecture-level solution.

The scale of exposure is no longer theoretical. Google’s April 2026 sweep of the Common Crawl repository found prompt injection attempts across public web pages, ranging from pranks to data exfiltration payloads, with malicious attempts up 32% between November 2025 and February 2026. Google noted sophistication remains low for now but flagged the trend as a signal of maturing attacker interest.

The environment the trifecta warned about has arrived.

How to sleuth out a compromised agent

If the trifecta describes nearly every deployed agent, practitioners need signals that distinguish compromised behavior from normal operation within a trifecta-exhibiting system. That means shifting from architecture-level assessments to runtime behavioral detection.

The production evidence arrived in a cluster. From Jan. 7 to Jan. 15, 2026, researchers disclosed exploits against four separate AI productivity tools in eight days: IBM Bob, Superhuman AI, Notion AI, and Anthropic’s Claude Cowork. Each used indirect prompt injection to exfiltrate data via a channel the agent had legitimate access to. In the Cowork case, a hidden prompt embedded in an uploaded document directed the agent to exfiltrate files via Anthropic’s own allowlisted API domain, invisible to any perimeter control and indistinguishable from normal agent behavior until the data was already gone. In all of these cases, the trifecta wasn’t a risk factor but the operating condition.

Here’s what’s worth watching to detect an agent has been compromised.

Instruction-following anomalies. A compromised agent doesn’t usually do something structurally different from a healthy one. Following instructions is its normal function. The difference is whose instructions it’s following. Look for agent actions that have no plausible correspondence to a user-initiated task. An agent that was asked to summarize a quarterly report but then attempts an outbound DNS request to an unfamiliar domain didn’t spontaneously decide to do that. Something in the content it ingested told it to.

Tool call sequences that break expected topology. In a well-designed agent system, the graph of tool calls for any given task should be relatively predictable. A coding agent invoked to fix a bug should touch files, run tests, perhaps check documentation. It shouldn’t be reaching for email or calendar APIs. Tool call sequences that cross expected workflow boundaries are worth flagging even when each individual call looks legitimate on its own.

Exfiltration via low-bandwidth channels. The classic prompt injection exfiltration attack routes stolen data through a mechanism the agent has legitimate access to: a rendered image URL with encoded query parameters, an API call with data embedded in a parameter, a link in a generated document. These don’t look like data theft in isolation; they look like normal agent output. Detection requires correlating what data the agent had access to against what it embedded in its output. That requires end-to-end visibility into the agent’s actions, not just the final response.

Credential and secret access outside task scope. If an agent with legitimate access to a secrets store or key vault touches credentials that have no relationship to the current task, that’s a signal. An agent fixing a React rendering bug should likely not be reading AWS credentials. Least-privilege scoping is the architectural defense here, but monitoring for out-of-scope credential access is the detection layer that catches failures in that scoping.

Memory-write anomalies. Agents with persistent memory are a growing attack surface. A poisoned memory entry that looks like legitimate user context but contains dormant trigger instructions can persist across sessions and fire long after the initial injection. Monitoring for memory-writes containing instruction-like content, or writes made during sessions that ingested untrusted content, is worth adding to any agent observability pipeline.

Runtime alone can address the agent redirection threat

For practitioners operating production agent infrastructure, the lethal trifecta tells you what you know: Your agents are exposed. The question is what to do about it.

The answers are at the runtime layer, not the architecture layer. That’s where EDR and SIEM live for traditional infrastructure — agents need the same instrumentation, and most deployments don’t have it yet. Full execution traces on every agent invocation. Tool call anomaly detection. Input screening at ingest. Credential access monitoring scoped to task context. Memory-write auditing. Not a human attacker logging in. An agent that’s been quietly redirected.

Willison’s trifecta was the right alarm for its moment, which was last year. Almost every production agent now fits the profile. Because of that, only runtime anomaly detection can potentially provide adequate defense. The above signals are a good place to start.

SUBSCRIBE TO OUR NEWSLETTER

From our editors straight to your inbox

Get started by entering your email address below.

推荐订阅源