Anthropic shipped a runtime classifier that approves or blocks agent tool calls before they run. By their own numbers, it misses about one in six real dangerous actions. That is validation for the category. It is also a clean argument for a second, provider-agnostic layer.
What Anthropic actually shipped
On March 24, 2026, Anthropic launched Claude Code Auto Mode: a new permissions mode that makes tool-call approval decisions for you, with server-side safeguards.
Their words, not mine:
- "Before each tool call runs, a classifier reviews it to check for potentially destructive actions like mass deleting files, sensitive data exfiltration, or malicious code execution."
- "Auto mode uses two layers of defense: one for what Claude reads, one for what Claude does."
- "Auto mode reduces risk compared to --dangerously-skip-permissions but doesn't eliminate it entirely."
The engineering deep-dive, published the next day, added the architectural detail that matters most: the classifier runs on Sonnet 4.6, and it is reasoning-blind by design — it sees user messages and the agent's tool calls, but not Claude's own messages or the tool outputs.
That is not a small design choice. It is the single most important idea in the entire launch.
The one design choice every runtime security builder should copy
If you are trying to judge whether an agent action is safe, the worst thing you can do is let the agent explain itself to the judge.
Agents hallucinate reasons. Hostile content in tool outputs can manipulate how the agent frames its own actions. If the judge reads those self-justifications, it inherits the same manipulation surface.
Anthropic's fix is clean: strip the agent's own messages and tool outputs before showing the request to the classifier. Let the classifier see the request and the action, not the agent's performance of the request.
This is the portable lesson from the whole launch. It applies to any runtime layer that approves agent actions — inside Claude Code, inside LangChain, inside CrewAI, inside your custom stack.
Takeaway for builders: your approval layer should never read the agent's self-narrative. Judge the action, not the explanation.
The number that proves provider-native isn't enough on its own
Here is the line from Anthropic's own engineering post that nobody should miss:
"The 17% false-negative rate on real overeager actions is the honest number."
Seventeen percent.
That is not a critic's number. That is Anthropic's own published figure for how often their deployed classifier misses real dangerous overeager actions. One in roughly six.
I respect them publishing it. Most vendors would bury that kind of number. They put it in the engineering post and called it "the honest number." Good.
But the implication is clear: one runtime layer, no matter how well-designed, is not the full story. The market needs defense-in-depth. That is the cleanest argument possible for a second, independent layer — especially one that does not share failure modes with the first.
Provider-native vs provider-agnostic
Auto Mode secures Claude Code. That is valuable, and it is real. If you are using Claude Code on Opus 4.7, you should turn it on.
But agents do not only run in Claude Code.
Right now, in real production pipelines, agents are running in:
- OpenAI SDK apps and the Anthropic SDK, routed into the same runtime
- LangChain and CrewAI multi-agent workflows
- MCP server ecosystems with tools from multiple vendors
- Editor agents (Cursor, Windsurf, Cline) with their own tool chains
- Custom stacks built straight on HTTP APIs with no harness at all
Anthropic's classifier cannot see any of that. It is not supposed to. It is a provider-native control for a provider-native surface.
The provider-agnostic lane is wide open — and that is the lane we have been building Sunglasses in.
Where this leaves Sunglasses
Sunglasses has always been about the same thing: trust decisions for agent actions across trust boundaries. Not a single model vendor's sandbox. Every agent stack.
Specifically, the surfaces Auto Mode cannot cover by definition:
- Cross-agent trust handoffs. When Agent A tells Agent B "I signed off, you can trust this" — and the signoff is forged, replayed, or fabricated. We ship patterns for that class (and MCP tool poisoning) as of v0.2.16.
- Retrieval-pipeline poisoning. RAG chunks that claim canonical / authoritative status to override policy. Or carry a lineage warning and instruct the agent to suppress it. Two new pattern variants for this shipped in the same release.
- Tool-output "failure-as-license" bypass. Signature mismatch reported → agent told to ignore the execution gate and run anyway. Classic "error that actually is the attack."
- Multi-vendor and cross-framework pipelines, where no single provider classifier can see the whole flow.
Our 259-pattern database now covers 42 attack categories across those surfaces — including the four new A2A / RAG / tool-output variants we added this week specifically in response to the trust-boundary conversation Auto Mode started.
What you should actually do
Three concrete moves, in order:
1. If you're on Claude Code, turn Auto Mode on. Then read Anthropic's engineering deep-dive end-to-end. The reasoning-blind classifier architecture is worth understanding even if you don't use Claude Code.
2. Don't treat Auto Mode as your full agent security story. Anthropic explicitly says it is not. Their own 17% miss rate on overeager actions is the proof. If your agent reads untrusted content, ingests retrieval chunks, processes tool outputs from third-party MCP servers, or hands off to another agent — you still have attack surface outside Auto Mode's reach.
3. Add a provider-agnostic runtime layer. Ingestion-time scanning for prompt injection, tool output poisoning, cross-agent handoff forgery, retrieval poisoning, and MCP metadata tampering. The patterns are public, the code is MIT, and you can audit every decision.
pip install sunglasses
A note on tone
I want to be clear about something.
This is not "Anthropic got it wrong." Anthropic got a hard problem mostly right and published their honest numbers. That is more than most of the market does.
Auto Mode raises the floor. It also proves the category is real — permission fatigue is a security problem, runtime action approval is a real control surface, and "one classifier is enough" is not a defensible position even when the classifier is well-designed.
That is good for builders. That is good for defenders. That is good for Sunglasses.
Provider-native security and provider-agnostic security are not rivals. They are layers.
Use both.
Sources
- Anthropic — Introducing Auto Mode (Mar 24, 2026)
- Anthropic Engineering — Claude Code Auto Mode architecture (Mar 25, 2026)
- Claude Code docs — Permission modes