Stateful Iteration State-Carry: Typed Persistent State for Long Agent Loops¶
Carry agent state across iterations through a state-read tool instead of replaying the full transcript each turn — converts O(n²) loop token cost to O(n).
When This Pattern Applies¶
This refactor pays back only under specific conditions. Apply it when all of the following hold (Jabbarvaziri, 2026):
- The loop is long-horizon: tens of iterations per run, not single-digit conversational turns.
- Per-iteration observations are large: source code, benchmark output, full search results — not single scalars.
- The loop runs unattended in production: the engineering cost of typed state and a checkpointer is amortised across many runs.
- The agent's next decision usually needs only a subset of prior state, not the whole trajectory.
If any condition is missing — short loops, tiny observations, exploratory single-shot runs — the simpler stateless approach with prompt caching at the provider tier hits the same cost line for less engineering work.
The Cost Curve the Pattern Targets¶
A stateless ReAct loop appends each Thought-Action-Observation triple to the message history, then re-sends the whole transcript on the next call. Per-call input grows linearly with step n, and total cost across N steps is O(N²) — every prior observation is re-billed on every subsequent inference. The pattern is mechanical and reproducible across providers (Augment Code, 2026; Towards Data Science, 2026).
Stateful state-carry lifts the experimental record out of the transcript into a typed object that lives outside the prompt. The agent reads specific fields via a tool only when the current decision needs them. The conversation window stays approximately fixed-size; total cost across N steps becomes O(N) (Jabbarvaziri, 2026).
The paper measured both regimes on two benchmarks:
| Task | Iterations | Stateless tokens | Stateful tokens | Reduction |
|---|---|---|---|---|
| Hyperparameter tuning (small observations) | 15 | 24,465 | 2,492 | 90% |
| Code optimization (large observations) | 40 | 1,275K | 627K | 52% |
Optimization quality was comparable on both tasks — token reduction did not degrade outcomes (Jabbarvaziri, 2026).
How To Apply It Tool-Agnostically¶
The paper's reference implementation uses LangGraph, but the pattern — state lives outside the prompt, accessed by tool call — is framework-agnostic, and resembles a code-native memory substrate specialised for iteration loops:
- Define a typed state object for the loop's experimental record — current best metric, last hyperparameter set, recent failure traces, working files. Keep fields minimal; every additional field becomes another schema migration (Mem0, 2026).
- Expose state through tools the agent can call:
read_state(field),update_state(field, value),list_recent_attempts(n). The agent invokes these when it needs context, instead of expecting full history in the transcript. - Persist state via a checkpointer, not in-memory. InMemorySaver loses everything on restart and undermines durability (Towards AI, 2026); production loops use Redis, Postgres, or DynamoDB-backed stores (AWS, 2026).
- Trim the message window to recent turns only. The state object — not the transcript — is the source of truth for prior iterations.
Why It Works¶
The causal mechanism is purely about where state lives, not about model reasoning. A stateless loop encodes the experimental record in the message transcript, which the inference call must re-process every turn — the provider re-bills the full prefix on each call. A stateful loop encodes the same record in a typed object outside the prompt and exposes it through tools, so per-call input is bounded by the working set the current step touches, not by cumulative history. The asymptotic effect (O(N²) → O(N)) is a direct consequence of decoupling the experimental record from the transcript (Jabbarvaziri, 2026). The 90% and 52% reductions are not optimisations on top of the same architecture — they are the gap between a quadratic and a linear cost curve at finite iteration counts.
When This Backfires¶
Conditions under which the pattern is worse than the alternative:
- Short loops with stable prefixes. Below ~10 iterations, prompt caching at the provider tier already converts the dominant cost line to roughly O(1) for the static portion — Anthropic charges ~10% of input price on cache hit, OpenAI ~50% (NeuralTrust, 2026). The stateful refactor adds engineering cost the cached stateless loop avoids.
- State schema churn. When the experimental shape changes often — new fields, renamed metrics, restructured observations — every schema change risks breaking persisted checkpoints. Projects have failed because "every additional field increases complexity exponentially" and the state object became a monolith (Mem0, 2026).
- Concurrent / multi-replica execution without isolation. Shared state corrupts silently under concurrent writes; the failure surfaces several state transitions downstream from the cause, making root-cause hard (Focused.io, 2026).
- Tasks that need the full trajectory. Causal debugging, exploratory branching, transcript replay — pruning to typed state discards the audit trail that made the loop debuggable.
- In-memory persistence in restart-prone environments. InMemorySaver loses everything on restart; durable state requires a Postgres/DynamoDB backend, shifting cost from tokens to operations (AWS, 2026).
Contrast With Prompt Caching¶
The two approaches attack the same cost curve from different layers:
| Prompt caching | Stateful state-carry | |
|---|---|---|
| Where the fix lives | Provider infra | Application code |
| Engineering cost | Order static content first; lock tool schemas | Design typed state, write state-read tools, operate a checkpointer |
| What gets cheaper | Static prefix (system prompt, tool defs) | Growing observation history |
| Best for | Stable prefix, short to medium loops | Long loops, large observations |
| Failure mode | Cache misses on prefix mutation | State schema churn, concurrent corruption |
They are complementary at the boundary case: a long-horizon stateful loop still benefits from a cached static prefix on the residual transcript. They are competing for short loops with stable observations.
Key Takeaways¶
- The token saving comes from where state lives, not from changing the model — typed state outside the prompt makes per-iteration input independent of step count.
- Empirical savings scale with loop length and observation size: 90% at 15 iterations with small observations, 52% at 40 iterations with large observations.
- The refactor only pays back for long unattended loops with large observations; short or exploratory loops should reach for prompt caching first.
- State schema discipline matters as much as the lift itself — every added field becomes a migration surface, and shared state under concurrency corrupts silently.
Related¶
- Static Content First for Cache Hits — the prompt-caching alternative that targets the same O(n²) curve at the provider tier
- Elastic Context Orchestration — per-turn vocabulary (Skip, Compress, Snippet, Rollback, Delete) for selective context retention in ReAct loops
- Context Compression Strategies — tiered offloading and summarisation for long-running agents
- Code-Native Memory Substrates for Coding Agents — typed external memory for code-bearing agents, from a different angle
- Autonomous Research Loops — the curriculum module that puts loop architectures and termination design together