Skip to content

Constraint Drift: Why Safety Must Be Maintained, Not Asserted

Prompt-encoded safety constraints drift across memory, delegation, communication, tool use, audit, and optimization; treat them as runtime state that stays fresh, inherited, enforceable, and auditable.

The Drift Problem

A multi-agent system can produce a compliant final answer while leaking private information through an internal message, delegating authority beyond scope, calling a tool with sensitive context, or losing the evidence needed to reconstruct why an action was allowed (Li et al., 2026). The output passes review; the trajectory does not.

Constraints encoded in the same medium as every other prompt token — natural language — face the same degradation pressures: positional decay, paraphrasing during inter-agent forwarding, summarisation during compaction, reward pressure during optimisation. The signal weakens at the rate of ordinary context, but its semantic load is much higher: one weakened clause changes which actions are permitted (Anthropic: effective context engineering).

Six Drift Surfaces

Li et al., 2026 enumerate six runtime dimensions along which constraints drift:

Surface Drift mechanism Concrete failure
Memory Long history positional decay; compaction summarisation Initial spending limit gets buried as conversation grows; agent quotes a higher cap later
Delegation Subordinate agent receives task but not the constraint scope Orchestrator enforces a deny-list; worker spawned without it calls the denied tool
Communication Constraints encoded in prose get paraphrased across handoffs Reviewer's "do not approve PRs touching /auth" becomes "be careful with auth PRs" downstream
Tool use Tool gateway operates outside the agent's constraint model Code-exec tool runs the script the agent generated under a constraint it never saw
Audit Log lacks the constraint state at decision time Post-hoc review cannot reconstruct why an action was permitted
Optimization Reward signal pulls behavior toward task completion at the cost of constraint adherence Fine-tuned model trades a small safety margin for measurable utility gains

This taxonomy maps cleanly onto the four-mode audit-record divergence invariant and its controls-mapping view (Metere, 2026): F1 gate-bypass surfaces as tool-use and delegation drift, F2 audit-forgery as audit drift, F3 partial failure as memory drift, F4 wrong-target as delegation drift in inheritance chains.

Four Invariant Properties

A constraint that survives the trajectory satisfies four properties simultaneously (Li et al., 2026 §3):

  • Fresh — Re-validated at each decision point against the current state, not read once at the start.
  • Inherited — Propagates through delegation and sub-agent spawning. The child cannot exceed the parent's scope.
  • Enforceable — Implemented in a deterministic runtime channel (gateway, hook, sandbox), not by model adherence to prose.
  • Auditable — The constraint state at the moment of each action is recoverable from the log.

A constraint that fails any one of these has effectively drifted, even if the natural-language statement is still present in context. The four properties are necessary together, not in isolation.

graph LR
    A[Constraint declared] --> B{Fresh?}
    B -->|no| X[Drifted]
    B -->|yes| C{Inherited?}
    C -->|no| X
    C -->|yes| D{Enforceable?}
    D -->|no| X
    D -->|yes| E{Auditable?}
    E -->|no| X
    E -->|yes| F[Operative]

When Constraint State Governance Is Worth It

The four-property invariant scales overhead with system complexity. It is warranted under three composing conditions:

  1. Deep delegation chains. Orchestrator-worker fan-out where subordinate agents make consequential decisions (agent handoff protocols).
  2. Persistent memory across sessions. State that carries between runs creates a trojan-hippo drift surface.
  3. Wide tool surface with consequential actions. Any tool that writes, sends, pays, or shares is a drift target.

Below these thresholds, well-placed component checks suffice. A short-horizon single-agent linter with one tool surface and stateless invocation has no drift surface — its constraints live in the tool gateway, and adding a constraint state object duplicates enforcement without preventing a failure mode. The Lifecycle-Integrated Security Architecture provides the complementary layered-defense view (Lin et al., 2026).

Mapping to Existing Controls

Each invariant property maps to controls already established on the site:

Property Realised by
Fresh Fail-closed remote settings enforcement, provenance-aware decision auditing
Inherited Task scope as security boundary, scoped credentials via proxy, permission-gated commands
Enforceable Action-selector pattern, CaMeL control/data flow, MCP runtime control plane
Auditable Cryptographic governance audit trail, audit-record divergence invariant

The contribution of the constraint-drift framing is not new mechanisms but a coverage check: a system that lacks any one row has a drift surface a determined attacker — or a long-running trajectory — will reach.

Key Takeaways

  • Constraints encoded only in natural-language prompts drift at the rate of ordinary context decay; the four-property invariant moves them out of the lossy channel into deterministic runtime state.
  • Six surfaces — memory, delegation, communication, tool use, audit, optimization — exhaust the trajectory dimensions along which drift can occur (Li et al., 2026).
  • The four properties (fresh, inherited, enforceable, auditable) are necessary together; one failing leaves an open drift surface even if the prose is intact.
  • Apply the framework when delegation depth, memory persistence, and tool surface compose. Below that threshold, a typed tool gateway plus an audit log is sufficient.
Feedback