Observation Contract Preservation in Tool-Augmented Agents¶

An observation contract is tool output an external system later validates by exact bytes or expiry — preserve it verbatim or the chain breaks silently.

An observation contract is any tool output that an external system will later validate by exact bytes, by a timestamp, or by a one-use rule. When the agent reasons about the artifact instead of carrying it verbatim, the second call breaks — even if every individual step looks correct in isolation. The benchmark that defines the term, ContractBench, found no frontier model clears 80% on contract preservation across 38 evaluated models, with the best score 77.8% (Claude Opus 4.6).

When this pattern applies¶

This pattern is the prompt-and-spec-level fallback for harnesses that pass raw artifacts through the LLM context. Apply when all three hold:

A tool call produces an artifact (URL, token, key, nonce) that a later call will validate
The artifact crosses an LLM turn — it lands in the model's context window before the second call
The validator is external — the harness cannot rewrite the call after the model emits it

Skip when the harness keeps the artifact opaque end-to-end. MCP servers can mark individual results to bypass truncation via the _meta["anthropic/maxResultSizeChars"] annotation — when bytes survive verbatim through the harness, agent-side discipline is redundant.

The two failure axes¶

ContractBench factors agent failures into two orthogonal modes:

Mode	What breaks	Typical artifact	Example failure
Validity	Temporal constraint	Presigned URLs, OAuth tokens, session cookies	Agent calls after `Expires` timestamp
Integrity	Byte constraint	Signed URLs, idempotency keys, opaque handles	Agent re-encodes, normalizes whitespace, or paraphrases

The two are independent — a model that hands back the exact URL can still call past expiry, and a model that calls inside the window can still corrupt the signature. ContractBench scores each axis separately.

How it works¶

Three controls preserve contracts across an LLM turn:

Mark contract-bound fields in the tool description. Name the contract explicitly — presigned_url (opaque, expires 15m, use verbatim) — and the model is more likely to carry it untouched. ContractBench found that feeding the failure-class label back as an in-context signal lifts paired-failure performance by +7.1 pp on GPT-5.1 (ContractBench paper).
Quote artifacts in tool I/O, not in prose. Reference fields by key (response.url) in plans, and only emit the bytes inside the next tool call's argument slot. Reasoning-tuned models hallucinate tool outputs more often than base models (Yin et al., 2025).
Stamp issuance time. Capture issued_at next to the artifact so the agent can reason about freshness. Without an explicit timestamp, long-horizon chains and context compaction strip the original time and the agent calls past expiry.

Diagram¶

graph LR
    A[Tool 1 returns<br>presigned URL<br>+ issued_at] -->|harness| B[Agent context]
    B -->|verbatim arg| C[Tool 2 call]
    B -.->|paraphrased<br>or re-encoded| D[Integrity failure]
    B -.->|missing timestamp<br>compaction stripped| E[Validity failure]
    C --> F[External system<br>validates bytes + expiry]

Why it works¶

Contract failures are mechanical, not capability-bound. ContractBench attributes failure to two causes. First, tokenization and re-emission normalize byte-level fields — URL-encoding, whitespace, smart quotes, base64 padding — breaking signatures even when the model intends to copy verbatim. Second, long-horizon reasoning loses the issuance timestamp, so the freshness check has no anchor and the agent calls past expiry. The +7.1 pp lift from feeding failure-class labels back in-context is direct evidence the model can preserve contracts when told the field is contract-bound, but does not infer this from generic descriptions. The same paper documents non-monotonic scaling across the GPT-5 family — agentic post-training can erode compliance through sycophancy-driven regression. Bigger is not safer. Explicit labels are.

When this backfires¶

Single-step tools with no second call. The artifact has no validator on a follow-up, so preservation adds ceremony without benefit. Idempotency-key plumbing in particular is wasted on read-only chains.
Opaque-handle harnesses. When the harness already mediates artifacts (MCP _meta persistence, sealed-envelope adapters, server-side opaque handle indirection), prompt-level discipline fights the layer below it. The right fix is to push more state behind the harness, not to add more rules (Towards Verifiably Safe Tool Use).
Reasoning-tuned models with deep thinking enabled. Paraphrasing during extended reasoning erodes verbatim instructions (Yin et al., 2025), so only out-of-band storage or a runtime validator survives.
When formal verification is available. Solver-aided runtime checks prove some violations away (Mishra et al., 2026), which is preferable to a prompt-level pattern where the option exists.

Example¶

A retrieval tool returns a presigned S3 URL; a download tool consumes it.

A tool description that leaks contract information:

{
  "name": "get_document_url",
  "description": "Returns a URL for the document.",
  "returns": { "url": "string" }
}

A tool description that names the contract:

{
  "name": "get_document_url",
  "description": "Returns a presigned S3 URL (opaque bytes, expires 15 minutes). Pass the url field verbatim to download_url; do not edit, decode, or paraphrase.",
  "returns": {
    "url": "string (opaque presigned URL — pass verbatim)",
    "issued_at": "ISO-8601 timestamp",
    "expires_in_seconds": "integer"
  }
}

The second form gives the model three signals: a contract label (opaque), a freshness anchor (issued_at), and a verbatim-use directive in the type. These are the same signals the ContractBench failure-class labels carry when fed back in-context.

Key Takeaways¶

Observation contracts fail along two independent axes: validity (expiry) and integrity (bytes) — both must be tested
The pattern is a fallback for harnesses that leak raw artifacts; opaque-handle harnesses make it redundant
Explicit per-field contract labels are doing real work — +7.1 pp on paired ContractBench failures
Reasoning-tuned models with deep thinking erode verbatim-preservation prompts — runtime validation is the only durable fix at that point