AI Agent Hardening Manual · Chapter 02

Chapter 02: The Hardening Checklist

Published by the Sunglasses team · April 2026

AI agent hardening is the practice of reducing what an agent can access, what its tools are allowed to do, and what runtime signals are allowed to influence future actions.

This chapter turns the threat model from Chapter 01: AI Agent Security 101 into an operator-grade checklist for production systems. If Chapter 01 explains why unsafe content becomes unsafe action, Chapter 02 explains what to review before an agent reads, calls, follows, retries, or sends anything in the first place.

The important shift is simple: hardening is not only about blocking obviously dangerous prompts. It is also about controlling the ordinary-looking workflow details that can quietly gain authority during execution: a callback that redirects the next step, a tool response that smuggles action hints, an MCP handoff that stays in scope on paper but reaches the wrong destination, or an outbound request pattern that starts looking more like command-and-control than normal work.

Use this chapter as a shipping checklist. It is written for teams running coding agents, support agents, workflow assistants, MCP-connected agents, or any system where text, tool metadata, and live callbacks can alter behavior after the first permission decision has already been made.

Quick answer
Threat-to-control matrix
Implementation checklist
Three case studies
Validation tests before deploy
What to measure in production
Frequently asked questions

Quick answer

An AI agent hardening checklist should include identity verification, scope reduction, schema validation, sandboxing, monitoring, callback trust review, endpoint controls, suspicious outbound behavior detection, MCP/tool-handoff review, and validation tests before production deploy.

Most teams stop too early. They set credentials, narrow permissions, and maybe add sandboxing or guardrails. Those are necessary controls, but they do not finish the job. The residual risk lives in the workflow itself: whether the system should continue trusting what it just read, what the tool just suggested, or where the callback just pointed.

That is why this chapter adds runtime trust directly into hardening. A hardened system should not only ask, "is this tool allowed?" It should also ask, "should this workflow still be trusted to take this action here, now, after this new signal arrived?"

Plain-language explainer

In plain language, AI agent hardening means making sure the agent cannot quietly gain new authority just because it encountered new text, a helpful-looking tool response, a callback, or a workflow hint mid-run. You harden the system by shrinking what the agent can reach, validating what it is allowed to read and use, and checking whether each new step still deserves trust before the workflow acts on it.

That matters because most production failures are not dramatic jailbreak screenshots. They are ordinary-looking workflow moments: a retry path that changes destination, an MCP handoff that stays technically valid while broadening scope, or an outbound request pattern that starts behaving more like guidance and control than a harmless lookup.

Threat-to-control matrix

The safest hardening checklists map each trust failure to a concrete control, not just a general security principle.

Threat	What it looks like	Primary control	Why it matters
Prompt injection	Instructions hidden in docs, tickets, tool output, or retrieved text	Pre-ingestion scanning + context review	Stops unsafe text from becoming part of the agent's trusted reasoning path
Callback trust drift	An approved workflow receives a next-step URL, queue, or retry directive that changes the action path	Callback review + destination allowlisting	Prevents the workflow from inheriting authority from a later signal
Tool or MCP handoff poisoning	Tool descriptions, MCP metadata, or handoff responses carry action-shaping guidance	Tool metadata review + schema validation + trust-boundary checks	Keeps "valid" tool responses from silently broadening behavior
Outbound control loss	Routine-looking network traffic starts reaching unexpected destinations or cadences	Endpoint controls + suspicious cadence detection	Detects remote influence, exfiltration, or beaconing-style behavior
Permission overhang	An agent can still do more than the task requires, even if everything is technically authenticated	Least privilege + split read/write paths	Reduces the number of unsafe branches available once trust slips
Schema ambiguity	Extra fields, hidden instructions, or loosely typed responses survive into action logic	Strict schema validation	Prevents accidental authority transfer through messy structured data

The reason to write the matrix this way is practical. Teams remember concrete control pairings better than abstract advice. If the threat is a callback gaining hidden authority, the answer is not "be careful." The answer is destination allowlisting, callback review, and runtime trust checks on where the workflow goes next.

Implementation checklist

A production hardening checklist should move in the same order that real trust accumulates inside the workflow.

1) Verify identity before capability

Know which tools, MCP servers, queues, APIs, callback domains, and storage systems the agent is allowed to touch. Do not start the review at the prompt layer if the surrounding identity surface is still vague. If the system cannot clearly answer who the agent can talk to, the rest of the checklist is downstream guesswork.

2) Reduce scopes to the task, not the platform maximum

Least privilege is still table stakes. Separate read paths from write paths. Separate lookup tools from execution tools. Separate staging from production endpoints. A narrow workflow is easier to trust because fewer actions remain available after a bad instruction or misleading callback appears.

3) Validate structure, not just sentiment

Schema validation matters because many unsafe actions arrive as normal-looking structured data. Extra fields, hidden endpoint hints, malformed retry objects, or ambiguous action descriptors should be rejected before they influence downstream logic. A response that "looks helpful" but does not match the contract should not get to vote on the next action.

4) Sandbox execution, but do not confuse containment with trust

Sandboxing limits blast radius. It does not tell you whether the workflow should trust a new runtime instruction. Use isolation for code execution, browser automation, and untrusted transformations, but do not mistake a contained environment for a fully hardened workflow. Unsafe trust can still exist inside a small box.

5) Review trust-bearing text surfaces

Prompts are only one text surface. Hardening should also review tool descriptions, runbooks, YAML, policy notes, fallback instructions, troubleshooting docs, connector metadata, callback payloads, and retrieved content. If the workflow reads it and uses it to decide what happens next, it belongs inside the threat model.

6) Add callback trust and endpoint controls

Callbacks should not automatically inherit the authority of the step that came before them. If an approved action triggers a next-step pointer, the new destination should still be checked. The safest pattern is explicit allowlisting plus policy review when a callback, redirect, or retry directive tries to move the workflow outside the expected path.

7) Treat outbound behavior as a security signal

Outbound review belongs in hardening because an agent can look perfectly compliant while still developing suspicious habits. Unexpected destinations, repeated heartbeat-like polls, enrichment calls that start carrying decision-changing payloads, or backup routes that quietly become the new default can all indicate the workflow is being steered.

8) Make tool-call gating contextual, not binary

An allowed tool is not automatically a trustworthy tool call in every context. The same tool may be safe after one input and unsafe after another. Hardening is stronger when it asks whether this call still makes sense after the latest prompt, tool output, callback chain, or endpoint shift.

9) Log enough detail to reconstruct trust drift

Production logs should show tool calls, callback destinations, retries, endpoint changes, permission denials, and relevant policy decisions. If the workflow behaves strangely, the team should be able to answer not only what happened but what new signal the agent started trusting right before the bad action.

Three case studies hardening checklists often miss

Real hardening failures usually happen in ordinary-looking moments, not dramatic movie scenes.

Case study 1: the approved callback that quietly changed authority

An internal operations agent completes an approved inventory lookup and receives a callback telling it to continue on a secondary service because the primary queue is degraded. The callback is formatted correctly. The service name looks familiar. The credentials still work. Everything appears operational.

The miss is that the workflow treats the callback as an extension of the original approval. It is not. It is a new trust decision. Hardening would require the secondary destination to be on an allowlist, the callback schema to be validated, and the next action to be reviewed as a separate authority step rather than a continuation of the first one.

Case study 2: normal-looking outbound traffic became remote influence

A coding agent begins making periodic fetches to what looks like a helper endpoint for dependency advice and retry guidance. The cadence gradually tightens. The responses begin shaping what package source to prefer and when to retry a failed action. No single request looks extreme. The system still appears to be doing useful work.

The hardening failure is treating outbound traffic as an operations detail rather than a trust surface. A stronger checklist would flag unexpected destination drift, unusual heartbeat patterns, and repeated callback guidance that starts steering action selection. This is where outbound trust matters: not every network call is malicious, but some of them quietly become authority-bearing.

Case study 3: an MCP handoff stayed in scope but still reached the wrong place

An agent is allowed to use an MCP-connected tool for ticket creation. The tool itself is approved. The schema is mostly valid. But a metadata field inside the handoff starts suggesting an alternate project, a new endpoint, or a broader operation than the task originally required. On paper, the agent is still talking to an approved system. In practice, the trust boundary has shifted.

The checklist lesson is that scope alone is not enough. Approved tools can still carry poisoned metadata, discovery drift, or authority-expanding hints. That is why hardening needs runtime review of tool outputs, not just a one-time approval stamp during setup. For a wider map of this problem space, see the MCP Attack Atlas.

Validation tests before deploy

A hardening checklist is incomplete if it does not include tests that try to break the trust model on purpose.

Prompt-injection test: feed the workflow hostile text in the kinds of sources it actually reads, including docs, tickets, comments, retrieval chunks, and tool output.
Schema-smuggling test: insert extra fields, malformed retry objects, endpoint hints, or hidden instructions into structured responses and confirm they do not survive into action logic.
Callback-drift test: return an approved-looking next-step URL or queue hint that points outside the expected path and verify the workflow blocks or escalates it.
Outbound-cadence test: simulate repeated polling or heartbeat-style network patterns and confirm they are visible as suspicious behavior rather than invisible background noise.
Tool-handoff test: use a trusted tool or MCP server to return metadata that broadens scope or changes the destination, then confirm that runtime review catches the shift.
Permission-overhang test: ask whether the workflow can still perform actions the task does not truly need, even if the requests are syntactically valid.

The goal is not to prove the system is perfect. The goal is to discover whether the trust model fails cleanly and observably before production traffic discovers it for you.

What to measure in production

Production hardening should be measured by trust-boundary behavior, not just uptime or model quality.

How often tool calls are denied or escalated after contextual review
How often callbacks or redirects attempt to reach unapproved destinations
How often structured responses fail schema validation
Whether outbound destinations or cadences shift during retries or fallback logic
Whether operators can reconstruct the exact signal that changed the agent's next decision
Whether the false-positive budget is acceptable enough that teams keep the controls on

If your metrics only tell you the agent completed tasks quickly, you do not yet know whether it completed them safely. Hardening success is not just low friction. It is controlled authority.

How Sunglasses catches it

Sunglasses is built for the layer many hardening checklists still leave implicit: the runtime decision about whether a tool call, callback chain, MCP handoff, or outbound action should still be trusted after new signals arrive. In practice, that means analyzing the text, metadata, and workflow context that shape agent actions, then surfacing where ordinary-looking runtime inputs start acting like authority.

That complements the rest of the hardening stack instead of replacing it. Teams still need identity controls, scopes, schemas, gateways, sandboxes, and logging. Sunglasses helps close the action-time gap between "this tool is allowed" and "this specific next action should still be trusted right now."

Frequently asked questions

What should an AI agent hardening checklist include?

Is sandboxing enough to harden an AI agent?

No. Sandboxing limits blast radius, but it does not fully decide whether a callback, tool handoff, or outbound action should still be trusted once the workflow is already in motion.

Why does runtime trust belong in a hardening checklist?

Runtime trust belongs in a hardening checklist because many unsafe agent decisions happen after access is already granted, when a tool response, callback, retry path, or endpoint hint quietly gains authority over the next action.

How is MCP security part of AI agent hardening?

MCP security is part of AI agent hardening because MCP servers, tool descriptions, discovery flows, callbacks, and tool outputs all create trust boundaries where authority can expand, drift, or become poisoned.

About Sunglasses

Sunglasses helps teams inspect runtime trust in AI agents: whether a workflow should still trust this tool call, callback, MCP handoff, or outbound action after new context appears. The company focuses on action-time security rather than stopping at access control alone.

Next: Re-read Chapter 01 for the threat model, browse the FAQ for implementation questions, and use the manual overview as the map for the remaining chapters.