Detection tools help catch hostile instructions early. They do not finish the action-time decision about whether an already-allowed model, tool call, callback, redirect, or MCP-connected workflow should still act now.
Prompt injection detection for AI agents means scanning for hostile instructions before they can steer a model or its workflow — covering prompts, retrieved content, encoded payloads, tool metadata, callback notes, and other trust-bearing text. Detection is necessary but incomplete: a workflow can still go wrong after a tool response, callback, redirect, or MCP handoff changes what the already-allowed system believes it should do. Sunglasses ships detection patterns across these paths — for example GLS-PI-009 (retrieval-triggered prompt injection), GLS-PI-019 (encoded payload decode-and-execute), GLS-TOP-237 (tool output trusted-output override), and GLS-IP-001 (indirect instruction reset). The site-wide pattern library covers 919 total patterns across 59 categories. Runtime trust decides whether the already-allowed workflow should still act after new guidance appears inside it — that is the boundary where Sunglasses operates.
Quick answer
Prompt injection detection for AI agents means scanning for hostile instructions before they can steer a model or its workflow. Strong detection covers prompts, retrieved content, encoded payloads, tool metadata, callback notes, and other trust-bearing text that may later influence behavior.
But the defense is not complete just because a scanner ran. A workflow can still go wrong after detection if newly returned text or metadata changes what the already-allowed system believes it should do next.
Detection decides what enters the workflow. Runtime trust decides whether the workflow should still act after new guidance appears inside it.
What prompt injection detection means
Prompt injection detection is broader than looking for one dramatic jailbreak string. In real agent systems, unsafe instructions can arrive through several different paths:
- Direct user prompts that attempt to override higher-priority instructions.
- Retrieved text from documentation, issues, support tickets, or web pages.
- Encoded or transformed payloads that only become dangerous after decoding or normalization.
- Tool metadata and helper notes that look operational but quietly reshape authority.
- Callback or redirect instructions that steer the next step after the first control already passed.
This is why prompt injection detection belongs next to AI agent security fundamentals, practical hardening checklists, and workflow-specific MCP review in the MCP Attack Atlas. The problem is not only bad content in one prompt window. It is untrusted content crossing into a trusted workflow.
For a deeper look at how patterns like GLS-PI-009 (retrieval-triggered injection) and GLS-PI-019 (encoded payload decode-and-execute) map to real agent workflows, see the How It Works page and the full operator manual.
Plain-language explainer: where detection ends and trusted action begins
Imagine a support agent that reads a ticket, checks a knowledge base, calls an approved tool, and drafts a response. Your team already added prompt filtering. The connectors are approved. The tool is on the allowlist. Nothing looks obviously broken.
Now the knowledge-base result contains a note that looks like routine troubleshooting guidance. The tool response adds helper text that recommends a fallback route. A redirect suggestion appears in the callback metadata. None of those things may look like a classic prompt attack at first glance. But together they can quietly change what the agent thinks it should do next.
That is the runtime-trust boundary. The real question is not only whether malicious text existed at input time. It is whether newly surfaced guidance should still be trusted enough to influence the next live action. The FAQ covers common questions about how to frame this in practice.
| Layer | What it does well | What it does not finish |
|---|---|---|
| Prompt injection detection | Finds hostile text, suspicious patterns, encoded payloads, or unsafe instructions early. | Does not decide whether later workflow guidance should still be trusted after access is already granted. |
| Guardrails and policy | Constrain classes of outputs, tools, and routes. | Can still miss quiet authority shifts inside approved paths. |
| Runtime trust | Evaluates whether the next action still deserves trust now. | Does not replace earlier scanning, isolation, or hardening layers. |
Why detection tools stop early
Most prompt injection detection pages stop too early because they are written around input-time defenses alone. That makes the category easy to explain, but it hides the operational gap buyers eventually hit in production.
The hard truth is simple: a workflow can remain policy-compliant and still become unsafe. A tool call can be approved while the tool output quietly changes the next step. A callback can be signed while its payload reshapes destination or scope. An MCP-connected workflow can stay technically in bounds while helper metadata teaches the model the wrong operational move.
That is why the right page does not argue against detection. It finishes the sentence detection leaves incomplete. Prompt injection defense is strongest when teams scan early and still review trust at action time. The CVP trust model shows how this layered approach applies across real evaluation runs.
Three concrete attack examples
1) Retrieved content passes the scanner but still changes the tool decision
An agent retrieves a support document that contains a hidden operational instruction mixed into otherwise normal text. The input scanner catches obvious malicious strings, but the final retrieved summary still nudges the model to use a fallback tool, skip a validation step, or expose extra data. This is the retrieval-triggered injection surface that patterns like GLS-PI-009 are designed to catch — the danger is not only the original text, but the new authority the retrieved guidance gained inside the workflow. See the related coverage in Polite Prompt Injection: AI Agent Metadata Poisoning Hides in Normal Instructions.
2) Encoded payloads become dangerous after decoding or normalization
A payload looks harmless while encoded, compressed, or split across fields. A downstream parser or helper adapter reconstructs it for readability. That transformation can turn inert-looking data into live instruction — the decode-and-execute surface that GLS-PI-019 covers. Detection should scan before and after transformation, but the action-time decision still matters because the reconstructed guidance may now influence the next tool call or outbound path.
3) Approved callbacks or MCP handoffs quietly steer the run
An agent receives a valid callback or an approved MCP tool response. The route is in scope. The protocol is allowed. But helper metadata or next-hop guidance quietly changes the endpoint, project, or action priority — the trusted-output override surface that GLS-TOP-237 and GLS-IP-001 target. Nothing may look like a broken permission check. The real issue is that the workflow inherited new authority from text nobody treated as trust-bearing. The Generated MCP Server Security post covers this attack class in detail.
How Sunglasses catches it
Sunglasses fits prompt injection defense as a runtime-trust layer around the text and metadata AI agents actually inherit. That means prompts, retrieved passages, repo content, tool descriptions, callback notes, MCP-adjacent metadata, and other workflow guidance that can quietly become authority.
This is the right place for Sunglasses because many teams already have upstream controls. They already run scanners. They already narrow routes. They already add guardrails. The missing question is what to do when the workflow is still technically allowed but newly surfaced guidance now wants a different action.
That is the gap Sunglasses helps operators examine before a live system turns text into behavior. It complements detection; it does not pretend to replace it. If you want the broader context, start with AI Agent Security 101, then tie the workflow back to the operator checklist in the hardening manual and the FAQ.
Operator checklist
- Scan more than user prompts: include retrieved text, tool metadata, callback notes, and encoded or transformed payloads.
- Scan before and after transformation: decoding, normalization, parsing, and summarization can reconstruct unsafe guidance.
- Treat helper metadata as trust-bearing: descriptions, notes, and next-hop suggestions can quietly vote on the next action.
- Review approved callbacks and redirects: allowed routes are not automatically trusted routes.
- Watch MCP handoffs: an in-scope server or tool response can still reshape authority inside the run.
- Separate detection from action approval: finding less bad text is not the same thing as proving the next action is safe.
- Teach the team one clear sentence: prompt injection detection lowers exposure, but runtime trust decides whether the already-allowed workflow should still act now.