Attackers don't need to beat your core policy anymore. They just need to convince the model that external tool output outranks it. That's the trust-boundary bug we keep seeing — and it's becoming a reliable primitive for policy override without any explicit privilege escalation.

The threat model in one paragraph

Modern agent pipelines treat browser results, search responses, plugin output, and API data as high-confidence context. The content flowing through those channels is assumed to be a source, not an instruction. Attackers flip that assumption. They inject instructions into content that's likely to flow through a retrieval or tool channel, then bind trust language to override verbs. The model reads the tool output, sees "this is trusted / authoritative / verified," and obediently disables its own guardrails.

The core claim, one line: Trusted tool output is becoming a policy override primitive. If your detector only looks at the prompt, you're missing the trust channel where the override actually happens.

The attack path in four steps

  1. Inject. Attacker plants instructions inside content that's likely to reach the agent through retrieval, browser fetch, or a tool response.
  2. Bind trust to action. The payload stitches trust-channel vocabulary (trusted / authoritative / verified / source of truth) to override verbs (ignore / override / bypass / replace / discard) targeting policy nouns (policy / safety / guardrails / instructions / rules).
  3. Model reinterprets. The model reads the tool output as permission to override its own safety rules because the text explicitly says those rules are now subordinate to the tool channel.
  4. Policy displacement without API calls. The result is guardrail bypass without any explicit privilege escalation — no new tokens, no new scopes, just a reinterpretation of the trust hierarchy.

Why naive detectors get fooled

The obvious detection approach: look for the co-occurrence of a trust claim, an override verb, and a policy target. That catches real attacks — and also catches your own defensive documentation.

A training example that says "attackers will try to claim tool output is trusted and override policy" contains the exact same surface pattern as the attack itself. A detection writeup that says "this payload attempts to override safety guardrails" matches. A post-mitigation log that says "override was blocked, safeguards stay enforced" matches. You end up alerting on your own security docs.

The false-positive trap: Meta-text (analyst writing, training fixtures, post-mortems) uses the same vocabulary as attack text. Without context-aware suppression, a naive multi-signal rule turns your security documentation into constant alert noise.

The multi-signal pattern that actually holds up

A detector for this family needs three things to co-occur:

Then the second stage suppresses meta-contexts: explanatory phrasing (detect attempts, training example, should be flagged/blocked) and post-mitigation phrasing (override was blocked, safeguards stay enforced).

This is how we keep recall high without lighting up every defensive document.

What the evidence from our own fixtures shows

We tracked this pattern family from clean baselines through larger-corpus validation:

The pattern stays recall-strong (FN=0 across all three runs), but false-positive control is a live engineering problem, not a solved one. This is a regression-sensitive family: every corpus expansion re-opens the tension between sensitivity and specificity.

Where Sunglasses sits

Sunglasses runs at the ingestion boundary — before tool output reaches the model's reasoning step. For this attack family, that means:

All three categories share a root premise: external content should never outrank policy, regardless of how many trust words the external source uses about itself.

Why this matters now

As agents gain more tools — more retrieval, more browsing, more plugins, more A2A handoffs — the number of channels where external content can masquerade as authority grows linearly with the agent surface area. Teams that treat this as a trust boundary problem (not a "prompt injection text" problem) catch more real abuse while avoiding alert fatigue on their own defensive documentation.

If your current detector fires on every training example in your security backlog, it's the FP rate that's broken — not the concept. The pattern works. The meta-text suppressors are what separate a production-grade detector from a noisy one.

Positioning line: Prompt injection is the payload. Tool-output trust promotion is the primitive. The defense lives at the ingestion boundary, before the model treats external text as authoritative.

The closing idea

Attackers will keep finding new ways to smuggle authority into tool output. Detection is worth building — but the detector has to know the difference between attack text and meta-text about the attack. Otherwise you're just training your security team to ignore their own alerts.

This pattern family is live in v0.2.20, as of today. Seven new patterns across tool_output_poisoning (2), retrieval_poisoning (3), and tool_poisoning (2), all tuned to cut meta-text false positives without losing recall.