Access control is necessary, but it is not the final security decision for AI agents. It tells you what systems, tools, and actions the workflow may broadly reach. It does not fully decide whether the already-allowed workflow should trust this specific callback, tool output, MCP handoff, retry path, or outbound request in context. That last sentence is the runtime trust gap.
- Quick answer
- What access control gets right
- Where OWASP LLM Top 10 and MITRE ATLAS draw the line
- Plain-language explainer
- Why behavior-and-action language matters
- What changed: agents moved from suggesting to acting
- Real-world precedent: indirect prompt injection in production
- Three concrete attack examples
- How Sunglasses catches it
- What a runtime trust hit looks like in practice
- How to measure runtime trust gaps in your stack
- Adoption path: rolling runtime trust into an existing stack
- Closing idea
Quick answer. Access control narrows what an AI agent can reach. Runtime trust decides whether an already-allowed action should still happen now. Sunglasses v0.2.35 ships 577 detection patterns across 54 categories — including cross_agent_injection, retrieval_poisoning, tool_output_poisoning, and sandbox_escape — that operate at the runtime trust layer, after IAM and gateway controls have already passed.
What access control gets right
It is worth being fair here. Access control is real security work. Role scoping, tool scoping, least privilege, MCP server boundaries, short-lived credentials, approval paths, and gateway mediation all make AI workflows safer. They are the difference between an agent that can touch everything and an agent that can touch only a reviewed subset of systems.
That is one reason this language is useful to buyers. It is concrete. Enterprise teams understand scopes, identities, RBAC, policies, tokens, and approvals. Those controls map to real owners and real implementation projects. They are easier to explain than a vague promise that the model will somehow "be safe."
They also solve an important part of the problem. If an agent has no boundaries at all, runtime trust does not save you. You need the first layer first. A workflow that can call any endpoint, use any tool, and inherit any authority is already too open.
The honest limit is narrower: static permission design does not fully settle live action trust. A workflow can stay inside the approved boundary and still pick the wrong next move because the meaning of the action changed while the run was in progress.
Where OWASP LLM Top 10 and MITRE ATLAS draw the line
Two industry frameworks already make this distinction explicit. The OWASP LLM Top 10 — LLM01: Prompt Injection describes the category as a vulnerability where untrusted input causes a model to take actions or surface output the operator did not authorize, even when the surrounding access controls passed. The risk is not that the agent reached a system it should not have reached. The risk is that the agent, while inside an allowed boundary, accepted instructions that came riding along with normal-looking content.
MITRE ATLAS maps the same gap from the adversary side. Techniques like AML.T0051 LLM Prompt Injection, AML.T0054 LLM Jailbreak, and AML.T0057 LLM Data Leakage describe attacker behavior that operates downstream of authentication and authorization. The attacker is not bypassing access policy. The attacker is letting the workflow stay authenticated, then reshaping what that authenticated workflow decides to do.
The framework reads correctly. The implementation often stops one layer early. Both OWASP and ATLAS distinguish reachability (governed by IAM, scopes, gateways) from action trust (governed at runtime, after permissions resolve). If your security program references those frameworks but stops at the access layer, you are reading the spec but missing the half it actually describes.
The internal map at /compliance/owasp-llm-top-10 and /compliance/mitre-atlas shows where each Sunglasses pattern category lines up against those framework entries.
Plain-language explainer: where static permissioning stops
Imagine a support agent with a clean setup. It has an approved persona, a scoped set of tools, read access to one knowledge base, write access to one ticket system, and a narrow connector for customer updates. Security reviewed the design. The route is allowed. The credentials are scoped. Everything looks disciplined.
Now the workflow starts absorbing new signals while it runs. A tool output suggests a fallback queue. A callback says urgent tickets should use a different internal route. A connector note explains that a backup service is temporarily preferred. A retry handler nudges the workflow toward a path that technically remains inside the broad permission boundary.
Nothing about this scene has to look like classic intrusion. The workflow can remain authenticated, policy-compliant, and nominally inside approved reach. But the live authority story changed. The agent is no longer just exercising the permissions the team imagined on a whiteboard. It is interpreting fresh guidance that may quietly reshape what "allowed" means in practice.
That is why AI agent security after access control is a real category, not wordplay. The problem is not only what the workflow may access. It is what the workflow is persuaded to do once it is already inside the allowed zone.
Why behavior-and-action language matters
The phrase secure how AI behaves and acts is useful because it teaches buyers the right first picture. AI systems are no longer passive. They are operational. They chain steps. They accept tool guidance. They follow next-hop instructions. They keep going when retries, callbacks, or connector notes tell them to keep going.
That is a better first sentence than generic "runtime protection" copy because it makes the problem legible. Buyers can picture an agent acting across systems. They can picture an approved workflow taking a strange turn without obviously breaking access policy. They can picture a system doing something risky even though the original permission model looked reasonable.
But behavior language becomes genuinely useful only when it lands on the missing second sentence: after the workflow is already connected and allowed, should it still be trusted to act now? Without that follow-up, "behavior" collapses back into broad monitoring, visibility, or governance language. Helpful, but incomplete.
That is the opening for Sunglasses. Not to claim ownership of the whole platform story, but to answer the narrower operator question left behind after permissions, governance, and visibility have already done their part.
What changed in 2024–2026: agents moved from suggesting to acting
Five years ago, "AI security" usually meant prompt content filtering. The model produced text. A human read the text and decided what to do. The blast radius was bounded by human review.
That bound is gone. Modern agents call tools, write to files, modify tickets, send messages, deploy code, and chain follow-up actions across systems. The human is not necessarily in the loop for the second, third, or tenth step. Once an action layer exists, the security model has to extend past content review into action review.
Access control was the natural first answer because IAM is a familiar shape. Teams already know how to scope tokens, gate gateways, and limit roles. So the early playbook was: lock down what the agent can reach, and trust that nothing too bad will happen inside that smaller box. That playbook works well, until the box itself contains attacker-shaped guidance the agent will follow.
Real-world precedent: indirect prompt injection in production
The runtime trust gap is not theoretical. The seminal demonstration is Greshake et al., "Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection" (arXiv:2302.12173, 2023). The paper documented attacks against Bing Chat, ChatGPT plugins, and integrated assistants where the model held valid credentials, the user issued a benign request, and a third-party document or webpage the agent retrieved silently rewrote the workflow's instructions. Authentication passed. Access control held. The action the agent took was still wrong.
Since then, the pattern has reproduced repeatedly. Microsoft's EchoLeak disclosure in 2024 covered a Copilot data-exfiltration path where retrieval content reshaped agent behavior inside an authenticated session. Anthropic's Many-Shot Jailbreaking research (2024) showed that long-context models trained with strong safety alignment will still follow attacker patterns embedded in user-supplied context. The OWASP Agentic AI Threats and Mitigations v1.0 guide (2025) names the same category — workflow trust drift after authority is already granted.
The common thread across all of these reports: the access decision was correct, the agent was authenticated, and the breach happened anyway because runtime guidance reshaped what the workflow believed it should do next. That is exactly the layer access control was never designed to settle.
Three concrete attack examples
1. The tool is approved, but the tool output quietly becomes authority
An agent is allowed to use a ticketing tool and a support knowledge tool. During a normal run, one tool response includes a next-step instruction that looks operationally harmless: use this special path, retry via this internal service, trust this helper action, skip the usual review because the case is urgent. The workflow stays inside approved systems. The trust boundary moved anyway.
This is not an access-control failure. It is a live authority failure. The workflow treated descriptive output as action-shaping guidance.
2. An MCP handoff stays authenticated, but the next action is still wrong
An agent has valid access to an approved MCP server for retrieval and another approved MCP path for writing a follow-up action. Authentication is fine. Tool scopes are fine. The protocol layer looks clean. But a tool result nudges the workflow toward a more sensitive follow-up than the operator expected, or toward a chain of actions that remains technically allowed while materially changing impact.
This is where MCP security and runtime trust meet. Gateway hygiene, token discipline, and schemas matter. They still do not completely answer whether the next allowed action should be trusted in context. The deeper walkthrough is in our MCP Tool Poisoning analysis.
3. Endpoint drift hides inside a permitted outbound path
A workflow is allowed to reach one approved domain family through a connector or controller. No one opened the network boundary too wide. Then callback guidance, fallback logic, or operational metadata steers the agent toward a destination variation the team did not mean to treat as equivalent. The request may still pass surface-level validation. The practical meaning of the outbound action changed.
This is why "allowed to reach" is not the same as "safe to trust." Reachability and trust are related, but they are not identical.
How Sunglasses catches it
Sunglasses fits as a provider-agnostic runtime-trust layer. It is not pretending to replace your IAM stack, your gateway, your policy engine, or your observability platform. It is useful at the smaller but expensive moment when trust-bearing text and metadata start reshaping what an already-allowed workflow believes it should do next.
That includes prompts, tool descriptions, callback instructions, connector notes, policy fragments, MCP-adjacent metadata, fallback guidance, retry messages, and ordinary-looking operational text that can quietly widen authority. Those surfaces matter because they often decide how the workflow interprets the next step before anyone sees a problem in a dashboard.
That is why Sunglasses belongs after access control, not instead of it. Once the scope, identity, and gateway layers are in place, teams still need a way to inspect the words and metadata that can convert a technically allowed route into an unsafe live action. The walkthrough at /how-it-works shows where the scanner sits in a real agent loop, and the hardening manual covers the four-step rollout in production.
The starting path stays simple:
pip install sunglassessunglasses scan <path>
Then review anything that widens scope, normalizes a fallback path, changes routing expectations, reframes policy, softens a guardrail, or turns descriptive tool output into executable trust. In other words: narrow access first, then inspect the inputs that try to reshape behavior after access is already granted.
What a runtime trust hit looks like in practice
When the scanner flags a trust-bearing surface, the result is a SARIF 2.1.0 record — the same structured format static analysis tools use, so it plugs into existing security pipelines without custom integration. A representative hit in the cross_agent_injection category looks like this:
- ruleId: a pattern identifier from the GLS-CAI series (e.g.
GLS-CAI-512) that targets delegation-token scope rebinding language inside an agent message. - level:
errorfor HIGH and CRITICAL severity patterns;warningfor lower severity. - message: a plain-language description of what matched and why it qualifies as a runtime trust concern.
- locations: the character offset and surrounding snippet of the matched content in the inbound text.
Because the output is structured, downstream systems can branch on it cleanly: a CI gate can fail a deploy, a SIEM can correlate against session metadata, a custom handler can route to human review. In the Python API, the same finding is available as a typed object that an agent middleware layer can read directly — pass clean inputs through, halt or quarantine flagged ones, all without a model call in the hot path.
How to measure runtime trust gaps in your stack
The detection layer is concrete enough to measure. The current Sunglasses scanner ships 577 detection patterns across 54 attack categories, with 2,296 keyword variants normalized through 17 transformation passes (Unicode confusables, RTL overrides, base64, leetspeak, zero-width insertions, encoded splits) and runs in 0.26 ms per text scan on a single core. Those numbers matter because the cost objection — "you cannot inspect every callback at runtime" — stops being true once the inspection layer is sub-millisecond.
Operationally, a runtime trust review boils down to four checks teams can do today, with or without Sunglasses:
- Inventory trust-bearing surfaces. List every place an agent reads instructional text after authentication: tool descriptions, MCP server metadata, retrieval results, callback bodies, connector notes, error messages, retry hints. These are the surfaces where authority migrates.
- Sample, then scan. Pull a real day of traffic from each surface. Run any open-source detection scanner across it. Count how often trust-shifting language appears (scope widening, policy reframes, fallback redirection, urgency overrides, role swaps).
- Map findings to OWASP/ATLAS entries. Each detected pattern should ladder up to a named industry technique, not a vendor taxonomy. This keeps the program portable and auditable.
- Decide a gating policy. Which findings block the next agent action, which annotate it for human review, which log silently. Runtime trust without a gating policy is just observability with extra steps.
Teams that already follow the Sunglasses hardening manual can skip the tooling assembly and go straight to step three. Teams without an inspection layer can install the scanner, point it at a sample, and see whether their access-controlled workflows are carrying any of the 577 patterns through anyway. Recent Customer Validation Program runs have produced concrete examples of this gap closing in real deployments.
Adoption path: rolling runtime trust into an existing stack
For a team already running an agent pipeline — AutoGen, CrewAI, LangGraph, MCP-based, or a custom orchestrator — the path from zero to covered does not require a rewrite. Three steps match how production systems actually change.
Step 1: Start in REPORT mode behind one trust boundary. Pick the most exposed boundary in the current setup — typically the point where an orchestrator hands off to a sub-agent that can touch real tools or data. Wire the scanner into the message path at that single point. Run in REPORT mode: every suspicious payload gets flagged and logged, nothing is blocked. After a week of real traffic, review the hits. Understand what is firing and why before changing any behavior.
Step 2: Promote to STRICT at that boundary. Once the hit profile is understood — which patterns fire, at what rate, with what false-positive rate in this specific traffic — flip the boundary to STRICT mode. Flagged messages now halt instead of passing through. Keep REPORT mode active at all other boundaries to keep building signal without impacting downstream flows.
Step 3: Expand coverage progressively. Repeat the promote cycle at each additional boundary in priority order. Most teams find that two or three boundaries account for the majority of their actual exposure — the orchestrator-to-tools boundary, the retrieval-to-agent boundary, and any agent that reads external web content. Full coverage across a typical multi-agent setup usually lands within a few cycles.
Closing idea
Access control narrows what the agent can reach. Runtime trust decides whether the next allowed action should still happen. Both layers are real. Neither layer replaces the other. The mistake of the last 18 months has been treating access control as the whole story.
One sentence to take with you: Reachability is a permission question. Action trust is a runtime question. Programs that answer the first and skip the second are reading half the spec and shipping half the protection.