I just spent time looking at Lakera Guard, Rebuff, and NVIDIA NeMo Guardrails back to back. And the pattern is getting clearer.
AI guardrails matter. But the word "guardrails" can also hide a dangerous ambiguity.
Sometimes people use it to mean:
- input filtering
- jailbreak detection
- output moderation
- topic control
- PII masking
Other times they use it as if it means: the AI app is now secure.
Those are not the same statement.
What the guardrail landscape gets right
The good prompt injection detection tools have all learned the same lesson: plain single-layer filtering is not enough.
Lakera talks about prompt attacks, data leakage, and policy enforcement. Rebuff uses layered detection plus canary leakage checks. NeMo Guardrails goes even further and treats the pipeline itself as the control surface:
- input rails
- retrieval rails
- dialog rails
- execution rails
- output rails
That is real progress. It means the field is moving away from the fantasy that one regex or one classifier will solve prompt injection.
What still worries me
Even the better guardrail systems mostly live inside the language-and-policy layer. That is important, but it is not the whole battlefield.
An agent can still get in trouble through:
- a poisoned README
- a malicious skill
- a compromised dependency — an AI supply chain attack
- an MCP server that changes after trust is established
- an overpowered tool with weak approval boundaries
- outbound network paths that make exfiltration easy
- allowed actions chained together in harmful ways
A guardrail can inspect, route, block, or rewrite. It cannot magically fix bad trust boundaries outside itself.
My current read on the three systems
Lakera Guard
Strong commercial posture. Good focus on prompt attacks and policy enforcement. Probably useful as a production screening layer. But from the outside, some of the depth is naturally harder to inspect because it is a vendor product.
Rebuff
Smart ideas. Especially the canary and attack-memory concepts. But it feels historically important more than frontier-defining now. Useful as a reference design. Less convincing as a complete modern agent defense.
NVIDIA NeMo Guardrails
This one feels the most structurally ambitious. It is not just checking text. It is trying to shape the interaction system. That matters.
But it also looks like real middleware, which means real complexity. Configuration burden. Latency tradeoffs. Integration sharp edges. That is not a flaw so much as a reality check. Serious control layers are rarely effortless.
A guardrail is not a security architecture
This is the core LLM security lesson I keep coming back to.
A guardrail is one control plane inside the architecture. Sometimes an important one. Sometimes a very smart one. But if the surrounding system still has:
- broad secrets access
- weak tool permissions
- permissive egress
- untrusted retrieval sources
- no provenance checks
- no action monitoring
Then the guardrail is defending a house with open side doors.
Where Sunglasses is different
Sunglasses cares about the full attack chain, not just the prompt surface. As an AI agent security scanner, we ask the questions that prompt injection detection alone cannot answer:
- Where did the content come from?
- What tool did it influence?
- What secret or file became reachable after that?
- What outbound path opened next?
- Did the permissions match the task?
- Did the agent's plan drift after reading untrusted content?
The best guardrails are getting better at inspecting language. The harder problem is correlating language, trust boundaries, tool use, and system behavior.
That is where the real fight in agentic AI security is.
April 2026 evidence: real advisories prove this right now
In the last 48 hours, multiple agent-adjacent advisories dropped with a shared lesson: the control looked present, but the boundary was weak or mis-modeled.
- MCP-localhost DNS rebinding — browser-to-local trust assumptions failed without strong Origin validation. This is a textbook MCP security failure.
- Sandbox-claim mismatch — operators believed a CLI flag restricted tool access, but effective runtime behavior did not match that belief.
- Agent-framework file boundary failures — path traversal and unsafe extraction where "helper" workflows crossed into arbitrary file-write risk.
None of those incidents are solved by prompt filtering alone. They require boundary verification across transport, policy semantics, filesystem scope, and action execution.
The mistake I see in security conversations
People ask: "Do we have guardrails?"
The more useful question is:
"Which boundaries can we prove are enforced right now, and which ones are assumed?"
That one wording change matters. A lot.
Because attacks increasingly exploit the distance between what teams think is enforced, what docs imply is enforced, and what runtime actually enforces.
Guardrails as one layer in a five-layer defense
If I were advising a team deploying production agents today, I would structure defenses like this:
# Five-layer agentic AI security architecture layer_1: ingestion_filtering # Treat docs, manifests, READMEs, MCP metadata, # and skill descriptions as attackable input surfaces layer_2: boundary_assertions_at_startup # Verify effective tool availability, origin checks, # workspace roots, and egress controls layer_3: runtime_policy_gates # Enforce least privilege for filesystem, network, # execution, and connector actions layer_4: chain_correlation # Detect: low-trust input -> high-trust data access # -> outbound action (within one session) layer_5: drift_detection # Continuously compare declared vs observed behavior # as tools/skills/config change over time
Guardrails are strongest in layers 1 and 3. You still need layers 2, 4, and 5 to close the loop. That is why calling any single tool a complete AI agent firewall is misleading.
What to measure (so this is not just philosophy)
If your team wants objective proof of security maturity, track these:
# Agent security maturity metrics tool_least_privilege: % of invocations with verified scopes origin_validation: % of localhost/private services with Origin tests chain_provenance: % of high-risk actions with input tracing drift_detection_mtd: mean time to detect policy vs behavior mismatch boundary_blocks: blocked trust-boundary crossings per 1K agent tasks
Those metrics reveal whether guardrails are truly integrated into architecture or just bolted on.
Connector configuration: the attack lane teams underestimate
A lot of teams mentally bucket connector fields into "content" (dangerous, inspect this) and "configuration" (safe, just settings). Recent advisories keep breaking that assumption.
If a field that looks like metadata — hostname, client-name, header option — is later concatenated into a protocol command, that field is no longer harmless config. It is a potential execution boundary.
For agent systems this is especially important because LLM-driven workflows often touch connector setup indirectly:
- onboarding assistants that help configure integrations
- tenant-level automation templates
- admin UIs populated from external docs
- migration scripts generated by coding agents
Even if no user is "typing shell commands," the system can still produce exploit-relevant values that cross trust boundaries. That is why mature agentic AI security needs a sink map, not just a prompt injection scanner.
Model artifacts can be the payload, not the prompt
A fresh April 2026 MONAI advisory (GHSA-89gg-p5r5-q6r4) is a useful reminder that modern agent risk does not begin and end with text prompts. Unsafe deserialization in an ML workflow path can lead to arbitrary code execution when low-trust artifact data is processed.
This matters because many teams still separate "AI security" and "software supply-chain security" into different budgets and controls. In practice, agent systems blend them:
- agents fetch or stage model artifacts
- pipelines auto-run evaluation or transformation steps
- orchestrators treat artifacts as "data" while hidden execution semantics travel inside serialization formats
If your strategy only governs prompts, you miss this entire class of AI supply chain attack. If your strategy maps and governs sinks across prompts, tools, protocols, and artifacts, you catch both jailbreak-style attempts and artifact-borne execution paths with one architecture.
Incident-to-control mapping: the table that makes this concrete
To avoid abstract debates, map incidents to control failures and required evidence. Here is what that looks like for the patterns we are seeing in 2026:
| Incident Pattern | Failed Assumption | Required Control | Evidence Artifact |
|---|---|---|---|
| MCP/connector SSRF via tenant/header/URL fields | "Routing metadata is harmless" | Destination policy enforcement + allowlisted URL classes | Block/allow decision log with parsed destination class |
| Path traversal in helper utilities | safe_join guarantees containment |
Post-normalization boundary assertion (realpath + root containment) |
Invocation trace showing pre/post path and containment verdict |
| Template/prompt injection into execution-adjacent sinks | "Prompt layer is separate from execution" | Adapter-layer sink integrity checks + structured argument firewall | Provenance map from prompt token to sink argument |
| Event stream/auth parity gaps | "If one endpoint is protected, siblings are too" | Route-family auth parity tests in CI and startup | Route parity report with pass/fail per endpoint |
| Sandbox/flag claim mismatch | "Feature flag implies hard boundary" | Declarative-vs-effective runtime conformance tests | Boot report proving enforced capabilities match declared policy |
This framing helps security teams choose controls based on failure mechanics instead of vendor category labels. None of these rows are addressed by prompt filtering alone.
How to evaluate "guardrails" without buying theater
Enterprise buyers are starting to ask for evidence, not adjectives. "We have guardrails" is a claim. "Here are the 14 high-risk sinks we verified this week, with drift checks and blocked chains" is evidence.
Five questions to ask in every vendor demo:
- Show me a blocked chain, not a blocked prompt. Ask for one replay where benign-looking steps are correlated into a risky sequence.
- Show sink-level controls. Ask which execution/query/path/protocol sinks are explicitly governed today vs roadmap.
- Show drift detection latency. Ask how quickly alerts fire after tool manifest changes or scope expansion.
- Show boundary assertion tests. Ask for startup/runtime checks proving effective containment.
- Show operator evidence artifacts. Ask for machine-readable traces that explain why a decision was made and which low-trust input influenced it.
If a vendor cannot show these in-product, they may have good text filtering but not full control-plane security.
The category split that actually matters
Guardrail-centric vendors are strongest at language-layer interception. Control-plane security vendors are strongest at proving boundary integrity from ingestion through execution.
Serious enterprise programs need both.
The strategic position for Sunglasses is explicit: yes, language-layer filtering is required. But the defensible moat is verifiable chain control across adapters, sinks, and drift. Not anti-jailbreak theater. Verifiable control of cross-boundary agent behavior.
Want to talk about what this means for your stack? We are building this in the open.
Final take
AI guardrails are real progress. They reduce real risk.
But if they are treated as the whole security story, teams will keep getting surprised by incidents that never looked like classic jailbreak text.
The future of agentic AI security is not a single filter. It is assumption verification across the full path from input to action to impact.
That is what we are building at Sunglasses. If you want to understand where this category is headed, start with our thesis, explore the FAQ, or scan your own agent with the open-source scanner.