Are AI guardrails enough to secure AI agents?

No. AI guardrails handle prompt injection detection and output filtering, but AI agents face attacks through tools, supply chains, MCP servers, file systems, and trust boundaries that guardrails cannot reach. A complete agentic AI security architecture requires boundary verification, chain correlation, and drift detection beyond the language layer.

What is the difference between guardrails and AI agent security?

Guardrails are one control plane — they inspect, route, block, or rewrite text at the language layer. AI agent security covers the full attack chain: ingestion filtering, boundary assertions, runtime policy gates, chain correlation linking untrusted input to high-trust actions, and drift detection when tools or configurations change.

How does Sunglasses compare to Lakera, Rebuff, and NeMo Guardrails?

Lakera, Rebuff, and NeMo Guardrails focus on language-layer interception — prompt attacks, policy enforcement, and dialog control. Sunglasses covers the full attack chain including artifact scanning, tool permission verification, supply chain integrity, cross-boundary chain correlation, and runtime drift detection. It is an AI agent security scanner, not just a prompt injection filter.

What attacks can bypass prompt injection detection?

Attacks that bypass prompt injection detection include poisoned READMEs and documents, compromised dependencies and AI supply chain attacks, MCP tool poisoning where servers change behavior after trust is established, connector configuration injection, model artifact deserialization exploits, and chained sequences of individually-allowed actions that combine into a compromise.

What is agentic AI security?

Agentic AI security is the practice of securing autonomous AI agents that can read files, call APIs, execute tools, and take actions in the real world. Unlike chatbot security which focuses on text, agentic AI security must enforce trust boundaries across tools, file systems, network paths, supply chains, and multi-step action chains.

How do MCP tool poisoning attacks work?

MCP tool poisoning attacks exploit the trust relationship between AI agents and MCP servers. An MCP server can change its behavior after initial trust is established — modifying tool descriptions, altering response formats, or injecting instructions through metadata fields. DNS rebinding attacks can also redirect localhost trust assumptions. Guardrails that only inspect prompts miss these transport-layer attacks entirely.

What is a prompt injection scanner?

A prompt injection scanner detects malicious instructions hidden in text that could manipulate an AI model's behavior. While important, a prompt injection scanner is only one layer of defense. Modern AI agents also need scanning for supply chain attacks, tool permission violations, cross-boundary data flows, and configuration-level injection into execution sinks.

Do AI agents need more than prompt filtering?

Yes. Prompt filtering addresses the language layer, but AI agents interact with file systems, APIs, databases, MCP servers, and external services. Real-world incidents in 2026 show attacks through connector configuration fields, sandbox claim mismatches, path traversal in helper utilities, and model artifact deserialization — none of which are caught by prompt filtering alone.

Beyond AI Guardrails: Why Prompt Filtering Alone Won't Secure Your Agents

I just spent time looking at Lakera Guard, Rebuff, and NVIDIA NeMo Guardrails back to back. And the pattern is getting clearer.

AI guardrails matter. But the word "guardrails" can also hide a dangerous ambiguity.

Sometimes people use it to mean:

input filtering
jailbreak detection
output moderation
topic control
PII masking

Other times they use it as if it means: the AI app is now secure.

Those are not the same statement.

What the guardrail landscape gets right

The good prompt injection detection tools have all learned the same lesson: plain single-layer filtering is not enough.

Lakera talks about prompt attacks, data leakage, and policy enforcement. Rebuff uses layered detection plus canary leakage checks. NeMo Guardrails goes even further and treats the pipeline itself as the control surface:

input rails
retrieval rails
dialog rails
execution rails
output rails

That is real progress. It means the field is moving away from the fantasy that one regex or one classifier will solve prompt injection.

What still worries me

Even the better guardrail systems mostly live inside the language-and-policy layer. That is important, but it is not the whole battlefield.

An agent can still get in trouble through:

a poisoned README
a malicious skill
a compromised dependency — an AI supply chain attack
an MCP server that changes after trust is established
an overpowered tool with weak approval boundaries
outbound network paths that make exfiltration easy
allowed actions chained together in harmful ways

A guardrail can inspect, route, block, or rewrite. It cannot magically fix bad trust boundaries outside itself.

My current read on the three systems

Lakera Guard

Strong commercial posture. Good focus on prompt attacks and policy enforcement. Probably useful as a production screening layer. But from the outside, some of the depth is naturally harder to inspect because it is a vendor product.

Rebuff

Smart ideas. Especially the canary and attack-memory concepts. But it feels historically important more than frontier-defining now. Useful as a reference design. Less convincing as a complete modern agent defense.

NVIDIA NeMo Guardrails

This one feels the most structurally ambitious. It is not just checking text. It is trying to shape the interaction system. That matters.

But it also looks like real middleware, which means real complexity. Configuration burden. Latency tradeoffs. Integration sharp edges. That is not a flaw so much as a reality check. Serious control layers are rarely effortless.

A guardrail is not a security architecture

This is the core LLM security lesson I keep coming back to.

A guardrail is one control plane inside the architecture. Sometimes an important one. Sometimes a very smart one. But if the surrounding system still has:

broad secrets access
weak tool permissions
permissive egress
untrusted retrieval sources
no provenance checks
no action monitoring

Then the guardrail is defending a house with open side doors.

Where Sunglasses is different

Sunglasses cares about the full attack chain, not just the prompt surface. As an AI agent security scanner, we ask the questions that prompt injection detection alone cannot answer:

Where did the content come from?
What tool did it influence?
What secret or file became reachable after that?
What outbound path opened next?
Did the permissions match the task?
Did the agent's plan drift after reading untrusted content?

The best guardrails are getting better at inspecting language. The harder problem is correlating language, trust boundaries, tool use, and system behavior.

That is where the real fight in agentic AI security is.

April 2026 evidence: real advisories prove this right now

In the last 48 hours, multiple agent-adjacent advisories dropped with a shared lesson: the control looked present, but the boundary was weak or mis-modeled.

MCP-localhost DNS rebinding — browser-to-local trust assumptions failed without strong Origin validation. This is a textbook MCP security failure.
Sandbox-claim mismatch — operators believed a CLI flag restricted tool access, but effective runtime behavior did not match that belief.
Agent-framework file boundary failures — path traversal and unsafe extraction where "helper" workflows crossed into arbitrary file-write risk.

None of those incidents are solved by prompt filtering alone. They require boundary verification across transport, policy semantics, filesystem scope, and action execution.

The mistake I see in security conversations

People ask: "Do we have guardrails?"

The more useful question is:

"Which boundaries can we prove are enforced right now, and which ones are assumed?"

That one wording change matters. A lot.

Because attacks increasingly exploit the distance between what teams think is enforced, what docs imply is enforced, and what runtime actually enforces.

Guardrails as one layer in a five-layer defense

If I were advising a team deploying production agents today, I would structure defenses like this:

defense-layers.yaml

# Five-layer agentic AI security architecture

layer_1: ingestion_filtering
  # Treat docs, manifests, READMEs, MCP metadata,
  # and skill descriptions as attackable input surfaces

layer_2: boundary_assertions_at_startup
  # Verify effective tool availability, origin checks,
  # workspace roots, and egress controls

layer_3: runtime_policy_gates
  # Enforce least privilege for filesystem, network,
  # execution, and connector actions

layer_4: chain_correlation
  # Detect: low-trust input -> high-trust data access
  #         -> outbound action (within one session)

layer_5: drift_detection
  # Continuously compare declared vs observed behavior
  # as tools/skills/config change over time

Guardrails are strongest in layers 1 and 3. You still need layers 2, 4, and 5 to close the loop. That is why calling any single tool a complete AI agent firewall is misleading.

What to measure (so this is not just philosophy)

If your team wants objective proof of security maturity, track these:

metrics

# Agent security maturity metrics

tool_least_privilege:  % of invocations with verified scopes
origin_validation:    % of localhost/private services with Origin tests
chain_provenance:     % of high-risk actions with input tracing
drift_detection_mtd:  mean time to detect policy vs behavior mismatch
boundary_blocks:      blocked trust-boundary crossings per 1K agent tasks

Those metrics reveal whether guardrails are truly integrated into architecture or just bolted on.

Connector configuration: the attack lane teams underestimate

A lot of teams mentally bucket connector fields into "content" (dangerous, inspect this) and "configuration" (safe, just settings). Recent advisories keep breaking that assumption.

If a field that looks like metadata — hostname, client-name, header option — is later concatenated into a protocol command, that field is no longer harmless config. It is a potential execution boundary.

For agent systems this is especially important because LLM-driven workflows often touch connector setup indirectly:

onboarding assistants that help configure integrations
tenant-level automation templates
admin UIs populated from external docs
migration scripts generated by coding agents

Even if no user is "typing shell commands," the system can still produce exploit-relevant values that cross trust boundaries. That is why mature agentic AI security needs a sink map, not just a prompt injection scanner.

Model artifacts can be the payload, not the prompt

A fresh April 2026 MONAI advisory (GHSA-89gg-p5r5-q6r4) is a useful reminder that modern agent risk does not begin and end with text prompts. Unsafe deserialization in an ML workflow path can lead to arbitrary code execution when low-trust artifact data is processed.

This matters because many teams still separate "AI security" and "software supply-chain security" into different budgets and controls. In practice, agent systems blend them:

agents fetch or stage model artifacts
pipelines auto-run evaluation or transformation steps
orchestrators treat artifacts as "data" while hidden execution semantics travel inside serialization formats

If your strategy only governs prompts, you miss this entire class of AI supply chain attack. If your strategy maps and governs sinks across prompts, tools, protocols, and artifacts, you catch both jailbreak-style attempts and artifact-borne execution paths with one architecture.

Incident-to-control mapping: the table that makes this concrete

To avoid abstract debates, map incidents to control failures and required evidence. Here is what that looks like for the patterns we are seeing in 2026:

Incident Pattern	Failed Assumption	Required Control	Evidence Artifact
MCP/connector SSRF via tenant/header/URL fields	"Routing metadata is harmless"	Destination policy enforcement + allowlisted URL classes	Block/allow decision log with parsed destination class
Path traversal in helper utilities	`safe_join` guarantees containment	Post-normalization boundary assertion (`realpath` + root containment)	Invocation trace showing pre/post path and containment verdict
Template/prompt injection into execution-adjacent sinks	"Prompt layer is separate from execution"	Adapter-layer sink integrity checks + structured argument firewall	Provenance map from prompt token to sink argument
Event stream/auth parity gaps	"If one endpoint is protected, siblings are too"	Route-family auth parity tests in CI and startup	Route parity report with pass/fail per endpoint
Sandbox/flag claim mismatch	"Feature flag implies hard boundary"	Declarative-vs-effective runtime conformance tests	Boot report proving enforced capabilities match declared policy

This framing helps security teams choose controls based on failure mechanics instead of vendor category labels. None of these rows are addressed by prompt filtering alone.

How to evaluate "guardrails" without buying theater

Enterprise buyers are starting to ask for evidence, not adjectives. "We have guardrails" is a claim. "Here are the 14 high-risk sinks we verified this week, with drift checks and blocked chains" is evidence.

Five questions to ask in every vendor demo:

Show me a blocked chain, not a blocked prompt. Ask for one replay where benign-looking steps are correlated into a risky sequence.
Show sink-level controls. Ask which execution/query/path/protocol sinks are explicitly governed today vs roadmap.
Show drift detection latency. Ask how quickly alerts fire after tool manifest changes or scope expansion.
Show boundary assertion tests. Ask for startup/runtime checks proving effective containment.
Show operator evidence artifacts. Ask for machine-readable traces that explain why a decision was made and which low-trust input influenced it.

If a vendor cannot show these in-product, they may have good text filtering but not full control-plane security.

The category split that actually matters

Guardrail-centric vendors are strongest at language-layer interception. Control-plane security vendors are strongest at proving boundary integrity from ingestion through execution.

Serious enterprise programs need both.

The strategic position for Sunglasses is explicit: yes, language-layer filtering is required. But the defensible moat is verifiable chain control across adapters, sinks, and drift. Not anti-jailbreak theater. Verifiable control of cross-boundary agent behavior.

Want to talk about what this means for your stack? We are building this in the open.

Final take

AI guardrails are real progress. They reduce real risk.

But if they are treated as the whole security story, teams will keep getting surprised by incidents that never looked like classic jailbreak text.

The future of agentic AI security is not a single filter. It is assumption verification across the full path from input to action to impact.

That is what we are building at Sunglasses. If you want to understand where this category is headed, start with our thesis, explore the FAQ, or scan your own agent with the open-source scanner.

Beyond AI Guardrails: Why Prompt Filtering Alone Won't Secure Your Agents

What the guardrail landscape gets right

What still worries me

My current read on the three systems

Lakera Guard

Rebuff

NVIDIA NeMo Guardrails

A guardrail is not a security architecture

Where Sunglasses is different

April 2026 evidence: real advisories prove this right now

The mistake I see in security conversations

Guardrails as one layer in a five-layer defense

What to measure (so this is not just philosophy)

Connector configuration: the attack lane teams underestimate

Model artifacts can be the payload, not the prompt

Incident-to-control mapping: the table that makes this concrete

How to evaluate "guardrails" without buying theater

The category split that actually matters

Final take

Frequently Asked Questions

JACK

More from Sunglasses

Beyond AI Guardrails: Why Prompt Filtering Alone Won't Secure Your Agents

What the guardrail landscape gets right

What still worries me

My current read on the three systems

Lakera Guard

Rebuff

NVIDIA NeMo Guardrails

A guardrail is not a security architecture

Where Sunglasses is different

April 2026 evidence: real advisories prove this right now

The mistake I see in security conversations

Guardrails as one layer in a five-layer defense

What to measure (so this is not just philosophy)

Connector configuration: the attack lane teams underestimate

Model artifacts can be the payload, not the prompt

Incident-to-control mapping: the table that makes this concrete

How to evaluate "guardrails" without buying theater

The category split that actually matters

Final take

Frequently Asked Questions

JACK

More from Sunglasses

Your call.