Dual-Write Append-Mirror for Agent Transcript Externalization¶
Write the agent transcript to local disk first; forward each batch to a remote store as a best-effort mirror that the agent never blocks on.
Dual-write append-mirror is the externalization shape the Claude Agent SDK ships: the claude subprocess "always writes to local disk first; the SDK then forwards each batch to append()" on an attached SessionStore, with the store positioned as "a mirror, not a replacement" (Claude Agent SDK: Persist sessions to external storage). It applies to self-hosted agents that need local execution context (project filesystem, locally-running MCP servers) and off-host durability for at least one of: multi-host resume on shared-nothing replicas, container reclamation on restart or scale-down, or compliance retention in operator-governed storage (Session storage; Hosting the Agent SDK). When none of those hold — a single developer laptop, sessions that never migrate — the local JSONL alone is the source of truth; when local execution context is not load-bearing, Anthropic's Managed Agents treats the session log as the only durable substrate and removes the dual-write surface entirely (Anthropic: Managed Agents).
The Adapter Surface¶
A SessionStore is "an object with two required methods, append and load, and three optional methods" (Session storage):
| Method | Required | Called when |
|---|---|---|
append |
Yes | After each batch of transcript entries is written locally |
load |
Yes | Once before the subprocess spawns, when resume is set; returns null for unknown sessions |
listSessions |
No | By listSessions() and by continue: true; SDK throws if called and not implemented |
delete |
No | By deleteSession(); deleting the main key "must cascade to all subkeys for that session"; if undefined, deletion is a no-op, "which suits append-only backends" |
listSubkeys |
No | During resume, to discover subagent transcripts; without it, "only the main transcript is restored" |
The minimum surface is intentionally narrow — two methods make adapter authorship cheap, and a 13-contract conformance suite ships with the SDK to validate behaviour (Session storage; examples/session-stores). Reference adapters exist for S3 (one JSONL part file per append()), Redis (RPUSH/LRANGE per transcript), and Postgres (one row per entry in a jsonb table) (Session storage).
Mirror-Failure Semantics¶
When the remote store fails, the agent does not. "If append() rejects or times out, the error is logged, a { type: "system", subtype: "mirror_error" } message is emitted into the iterator, and the query continues. The local transcript is already durable on disk, so a store outage does not interrupt the agent or lose data locally. Batches that fail are not retried, so monitor for mirror_error if you need to detect store data loss" (Session storage).
Three operational consequences follow from the no-retry policy:
- Alerting on
mirror_erroris the only signal that the remote copy has drifted from local truth. - The adapter is responsible for any retry semantics it wants — the SDK will not call
append()again on the same batch. - An adapter that silently swallows duplicate-key errors (e.g., a Postgres adapter that deduplicates at insert time) can drop data while emitting
mirror_erroran operator may ignore.
Compaction vs Raw: Two Reads, Two Consumers¶
The dual-write shape forces a clean split between what the agent sees on resume and what the audit trail holds. "getSessionMessages({ sessionStore }) returns the linked message chain the agent would see on resume. After auto-compaction, earlier turns are replaced by a summary, so a session whose store holds 503 raw entries may return 18 messages from getSessionMessages. For the full raw history, including pre-compaction turns and metadata entries, call store.load(key) directly" (Session storage).
Two consumers, two reads:
- Resume engine →
getSessionMessages()→ post-compaction chain. - Audit trail, compliance review, debugging →
store.load(key)→ raw history.
A team that conflates the two will misread incidents — the documentation makes the distinction explicit, but the surface is non-obvious until something goes wrong.
Fork Is Not Byte-Copy¶
forkSession reads source entries, "rewrites every sessionId field and remaps message UUIDs, then appends the transformed entries under a new key. An adapter-level copy or CopyObject shortcut would produce a transcript that still references the old session ID, so the SDK does not use one" (Session storage). Custom adapters that try to optimise forks with backend-native copy primitives (S3 CopyObject, Postgres INSERT INTO ... SELECT) break the transcript silently. The conformance suite catches this — running it is part of adapter ownership.
Flush Mode and Retention¶
The default session_store_flush: "batched" mode waits for end-of-turn before delivering frames; v0.1.73 added an "eager" mode that "delivers frames to SessionStore.append() in near-real-time instead of waiting for the end-of-turn flush, enabling live-tailing UIs, cross-process resume, and crash-durability use cases" (claude-agent-sdk-python CHANGELOG). Eager trades remote-write volume for live observability — on per-request-priced backends (S3 PUT, DynamoDB write capacity) the bill scales with agent verbosity.
Retention sits on the adapter, not the SDK: "The SDK never deletes from your store on its own. Retention is the adapter's responsibility: implement TTLs, S3 lifecycle policies, or scheduled cleanup according to your compliance requirements. Local transcripts under CLAUDE_CONFIG_DIR are swept independently by the cleanupPeriodDays setting" (Session storage).
Why It Works¶
Dual-write append-mirror works because it inverts the dependency between the agent's liveness and the durability layer's liveness. The authoritative write lands on a single substrate (local JSONL) before any secondary copy is attempted, so a remote-store outage degrades replication but not the primary write — the standard write-ahead-log replication shape (Architecture Weekly: The Write-Ahead Log). The agent's loop sees the local write succeed and continues; the remote write happens asynchronously, and its failure is surfaced as data (a mirror_error system message) rather than as a control-flow interruption. Hosting docs name this directly as one of three things to know about SessionStore: "Mirror, not replacement: the subprocess writes to local disk first, and the store receives a copy of each batch. Local writes remain authoritative" (Hosting the Agent SDK). The naive alternative — store-as-replacement, where the agent blocks on the remote round-trip — couples agent liveness to store liveness; a slow store stalls the loop and a flaky store kills sessions. The dual-write inversion preserves the unconditional fast-path and pushes the consistency burden onto the operator's monitoring.
When This Backfires¶
- Atomic-coupling expectations. The general "dual-write is an anti-pattern" critique targets designs that need atomicity across two stores (Confluent: The Dual-Write Problem). This pattern is not symmetric — local is authoritative, mirror is best-effort. Treating the two as needing to agree on every write recreates the anti-pattern; transactional outbox or CDC is the right shape if you genuinely need atomicity.
- Append idempotency violations in the adapter. The SDK does not retry failed batches. An adapter that swallows duplicate-key errors at insert time loses data while logging
mirror_erroran operator may ignore. CLAUDE.mdand working-directory artifacts are not mirrored. Resuming on a different host with the transcript intact but noCLAUDE.mdyields a behaviourally different agent (Hosting the Agent SDK). Mount a shared volume or sync those separately — the pattern delivers transcript portability, not full agent portability.- Compaction divergence between resume and audit consumers. Anyone assuming
getSessionMessages()andstore.load()return the same data will misread incidents; the 503-vs-18 gap is a documented surface that bites teams who do not internalise it. - Single-tenant, single-host deployments. When the host is a developer laptop and the session never migrates, the local JSONL is already the source of truth — the mirror adds a failure surface (
mirror_error) with no resume benefit. The pattern pays for itself only when at least one of {multi-host resume, container reclamation, compliance retention} is real. - Incompatible options. "Because the mirror depends on local writes,
sessionStorecannot be combined withpersistSession: false; the SDK throws if you set both. It also throws if combined withenableFileCheckpointing, since file-history backup blobs are written directly to local disk and are not mirrored to the store" (Session storage). - Managed-Agents-style centralised durability fits better. When local execution context is not load-bearing — no project filesystem, no local MCP servers — Anthropic's Managed Agents architecture treats the session log as the single durable substrate and removes the two-system reconciliation surface entirely (Anthropic: Managed Agents). Use that path when build-vs-buy points to buy.
Example¶
A team operating self-hosted Agent SDK workers in a Kubernetes hybrid-session pattern wants threads to survive pod reclamation without losing in-flight reasoning.
Before — naive externalization (store-as-replacement):
Agent loop --[blocking write]--> Remote store
\--[on store error]--> session fails, user sees error
A slow store stalls the loop; a flaky store kills sessions.
After — dual-write append-mirror with SessionStore:
import { query } from "@anthropic-ai/claude-agent-sdk";
import { S3Client } from "@aws-sdk/client-s3";
import { S3SessionStore } from "./S3SessionStore"; // copied from examples/session-stores/s3
const store = new S3SessionStore({
bucket: "my-claude-sessions",
prefix: "transcripts",
client: new S3Client({ region: "us-east-1" }),
});
for await (const message of query({
prompt: "Continue investigation",
options: { sessionStore: store, resume: previousSessionId },
})) {
if (message.type === "system" && message.subtype === "mirror_error") {
metrics.increment("session_store.mirror_error"); // operator alert path
}
}
The local JSONL under CLAUDE_CONFIG_DIR is authoritative; S3 receives a best-effort copy of each batch; a regional S3 outage emits mirror_error into the message stream and the agent keeps stepping. On pod loss, a new pod loads the transcript from S3 via resume, hydrates the agent, and continues (Hosting the Agent SDK: Hybrid sessions; Session storage).
Key Takeaways¶
- The Claude Agent SDK ships dual-write append-mirror as the externalization shape — local-first with the remote store as a best-effort mirror, not a replacement
- Mirror failures emit a
{ type: "system", subtype: "mirror_error" }message and the query continues; the SDK does not retry, so operator alerting on this event is the only data-loss signal - The adapter surface is two required methods (
append,load) plus three optional (listSessions,delete,listSubkeys); a 13-contract conformance suite ships with the SDK to validate it getSessionMessages()returns the post-compaction chain the agent sees on resume;store.load(key)returns the raw history — two reads for two consumers (resume engine vs audit trail)forkSessionrewrites session IDs and remaps message UUIDs at the SDK layer; adapter-levelCopyObjectshortcuts produce silently-broken forks that the conformance suite catches- Retention is the adapter's responsibility — the SDK never deletes from your store; TTLs, S3 lifecycle, scheduled cleanup live on the storage side
- The pattern is Qualified: apply it when local execution context is load-bearing and at least one of {multi-host resume, container reclamation, compliance retention} is real; otherwise the local JSONL alone suffices or Managed Agents is the better path
Related¶
- Remote Agent Host Sessions over SSH and Dev Tunnels — where the host is the durability boundary instead of an attached store; complementary externalization shape for editor-attached topologies
- Cloud-Agent Three-Layer State Decoupling — the higher-level state-split this pattern sits inside; conversation state is the layer the mirror externalizes
- Session Harness Sandbox Separation for Long-Running Agents — the three-primitive architecture under which the session log is the durable substrate
- Long-Running Agents: Durability and Resumability Across Sessions — the operational shape that makes off-host transcript durability worth running
- Delta Channels: Bounded Checkpoint Storage for Append-Only Agent State — keeps the append-only mirror linear in storage cost over long sessions