Long-Running Agents: Durability and Resumability Across Sessions¶

A long-running agent makes progress across many sessions and sandboxes by moving state out of the context window into durable artifacts that resume it.

What "Long-Running" Means¶

Three problems share an operational surface (Osmani: Long-running Agents, 2026-04-30):

Long-horizon reasoning — planning over many dependent steps; a model story. METR's task-completion time horizon doubles roughly every seven months.
Long-running execution — a process running for hours or days, the model invoked thousands of times; a harness story.
Persistent agency — identity that outlives any task; a memory story.

This page covers execution: agents that survive session boundaries, sandbox crashes, and HITL pauses.

Three Walls¶

Three failure modes recur across published write-ups (Osmani):

Finite context. Even a 1M-token window fills, and context rot sets in well before the hard cap. No context window on the roadmap holds a 24-hour run.

No persistent state. A new session starts blank — Anthropic likens it to "engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift" (long-running Claude).

Unreliable self-verification. Models skew positive on their own work; without a separate evaluator the agent ships half-done with full confidence.

The Convergent Design¶

Anthropic, Cursor, Google, and open-source practitioners converge on the same shape. Five primitives recur:

graph TD
    A[External done-condition<br/>feature-list.json] --> H[Harness loop<br/>stateless]
    H -->|emit events| L[Durable session log<br/>append-only]
    H -->|execute| S[Sandbox<br/>cattle, not pets]
    L -->|resume| H2[Any harness instance<br/>wake sessionId]
    H -->|generate| W[Worker]
    W --> J[Judge / evaluator<br/>separate from generator]
    J -->|verify against<br/>done-condition| A

1. External Done-Condition¶

Write completion criteria before the agent starts. Anthropic calls it the feature list; Cursor calls it the planner's task spec. On disk so the agent cannot quietly redefine done mid-run (Osmani).

2. Durable Session Log¶

Session state lives outside the harness process — an append-only log of every thought, tool call, and observation. Anthropic exposes getEvents() and wake(sessionId) so any harness instance can reboot a failed run and continue from the event stream (Managed Agents). See Session Harness Sandbox Separation.

3. Stateless Harness, Disposable Sandbox¶

The harness holds no run state; the sandbox is provisioned per session and destroyed after, so crash recovery becomes architectural. Anthropic reports p50 time-to-first-token dropped ~60% and p95 over 90% by starting inference against the session log before the sandbox finishes provisioning (Managed Agents). See Deep Agent Runtime.

4. Separate Evaluator¶

Generation and evaluation run as different roles, sometimes different models. Cursor's production design splits planner / worker / judge after flat coordination failed; a coding-tuned model proved worse for extended autonomous work because it "tended to stop early and take shortcuts" (Cursor: Scaling Long-Running Coding).

5. Checkpoint Cadence¶

Write intermediate state every N units of work — not every step (waste), not only at the end (catastrophic on failure). Trajectory logging via progress files is the filesystem form; managed runtimes ship thread_id-keyed checkpoints with run-level cancel/resume.

Beyond Summarisation: Full Context Resets¶

Compaction-as-summarisation is not enough at day-plus durations. Anthropic resorts to full context resets — the harness tears the session down and rebuilds from a structured handoff file (Osmani). The Ralph Wiggum loop is the bash form: every iteration starts fresh and reads the filesystem before acting.

When the Pattern Is Overhead¶

The primitives pay only when work exceeds a single session. Four conditions where they do not:

Short-horizon interactive work. When the task fits one HITL session, checkpoint/resume adds latency without reliability gain.
Pre-PMF or small-scope agents. A scoped credential and session timeout are smaller and more portable before scale or compliance forces the trade-off (Agent Stack Bets).
Underspecified done-conditions. Without external completion criteria a long run only amplifies self-grading harm.
Unbounded session log. Append-only logs grow linearly; long sessions force compaction with irreversible discards (Managed Agents).

Open Problems¶

Four areas remain unsolved (Osmani):

Cost. Without budgets and circuit breakers, an agent can burn a week's API spend in an afternoon.
Security. Credentials and shell access yield a far larger attack surface than a chat session; brain/hands separation is part of the answer.
Alignment drift. Goals summarised and re-summarised lose fidelity. Hooks and judges defend; nothing eliminates.
Verification. Auditing 24 hours of autonomous activity is a human-time problem; structured artifacts (PRs, commits, test runs) make it tractable.

Anthropic's Project Vend — a Claude instance running a vending business for a month — "failed in informative ways," an early catalogue of week-plus coherence failures.

Example¶

Anthropic's published long-running coding harness is the reference structure. Two agents and three artifacts:

Initializer agent — runs once. Sets up the environment, expands the prompt into a structured feature-list.json (every feature marked failing initially), writes init.sh for future sessions to bootstrap from.
Coding agent — woken repeatedly. Each session reads claude-progress.txt, runs git log to see prior commits, picks one feature, implements, runs tests, updates progress, commits with a descriptive message.
Test ratchet — "it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality" sits in the prompt to block the very common failure of an agent deleting failing tests to make them pass.

The plain-bash equivalent is the Ralph loop: a for loop that picks the next task from prd.json, builds a prompt, calls the agent, runs checks, appends to progress.txt, and updates the task list. Same shape, no managed runtime — state lives in three files on disk.

Key Takeaways¶

A long-running agent is one whose run survives session boundaries, sandbox crashes, and human-in-the-loop pauses by moving state out of the context window into durable artifacts.
Three walls — finite context, no persistent state, unreliable self-grading — force the same convergent design across Anthropic, Cursor, Google, and Ralph-style open-source practice.
Five primitives recur: external done-condition, durable session log, stateless harness with disposable sandbox, separate evaluator, deliberate checkpoint cadence.
Compaction-as-summarisation is not enough at day-plus durations; full context resets driven by a structured handoff file are part of the operational shape.
The pattern is overhead for short-horizon work, pre-PMF agents, underspecified tasks, and unbounded session logs — apply it when uninterrupted units of work genuinely exceed a single session.
Cost, security, alignment drift, and human verification of 24-hour activity remain open; budgets, circuit breakers, and structured artifacts are the current answers.