Exception Handling and Recovery Patterns¶
Exception handling decides whether a failing agent recovers and continues or fails catastrophically — corrupting state, losing progress, and repeating work.
Exception handling for coding agents is a progressive escalation — self-correct, fallback, degrade gracefully, escalate — that absorbs tool errors, model failures, and crashes without losing accumulated work.
The progressive failure hierarchy¶
graph LR
A[Self-Correct] --> B[Fallback]
B --> C[Degrade Gracefully]
C --> D[Escalate]
style A fill:#2d5a2d
style B fill:#4a4a1a
style C fill:#5a3a1a
style D fill:#5a1a1a
Self-correct: detect the error and retry or adjust. Most tool errors resolve here — a failed file read triggers a path correction, a syntax error triggers a fix.
Fallback: when the primary approach fails repeatedly, switch to an alternative strategy or model — the threshold an agent circuit breaker makes explicit.
Degrade gracefully: deliver partial results rather than failing entirely.
Escalate: surface the failure to a human with enough context. Use this as a last resort, not a first response.
Git-based recovery¶
Git is the primary recovery mechanism for coding agents. Anthropic recommends this approach for long-running agents (multi-agent research system):
- Commit frequently, so each commit is a checkpoint
- Write progress files (for example,
claude-progress.txt) that survive session crashes — see Goal Monitoring and Progress Tracking - Revert to known-good states with
git revert
Git operations are cheap, atomic, and reversible. See Rollback-First Design for the broader principle.
Model-driven error adaptation¶
Anthropic's multi-agent research system reports that telling the model a tool is failing and letting it adapt works "surprisingly well" — the model reroutes without explicit fallback logic in the harness.
The simplest strategy is to catch the tool error, include the message in the agent's next context, and let the model decide. This outperforms rigid retry logic for novel failure modes, provided the retried action is safe to repeat (see idempotent agent operations).
When model-driven adaptation fails
This breaks down for silent failures — the agent produces output without detecting the underlying error (stale data, partial writes, skipped validation). The model has to know something went wrong. Add output validation and freshness checks for failure modes the model cannot observe directly.
Durable execution¶
For agents that must survive process crashes, durable execution frameworks checkpoint state after every step:
LangGraph provides three durability modes:
| Mode | Behavior | Use case |
|---|---|---|
exit |
Persist only at graph exit | Human-in-the-loop gates |
async |
Persist asynchronously while next step runs | Long-running research |
sync |
Persist synchronously before each step | Mission-critical workflows |
State is checkpointed to a configurable backend (Postgres, DynamoDB, others); after a crash, the agent resumes from the last checkpoint. LangChain pairs durability with a concrete retry, timeout, and error-handler taxonomy built into the graph runtime, giving the progressive hierarchy a framework-grounded fault-tolerance reference (LangChain, fault tolerance in LangGraph).
DBOS takes a decorator-based approach: @DBOS.workflow and @DBOS.step persist execution state automatically with exactly-once semantics.
Both solve the same problem: a 30-minute run should not lose all progress to a crash.
Model fallback¶
When a model provider fails, route to an alternative. LangChain's ModelFallbackMiddleware chains models automatically (Primary → Fallback 1 → Fallback 2), handling outages and rate limits — though different models may produce different results for the same prompt.
That divergence is the trap: a fallback that succeeds silently can mask a quality regression rather than surface a failure. One practitioner account describes silent LLM fallbacks breaking agent pipelines downstream and argues for an explicit recovery layer that makes the switch observable rather than transparent (Towards Data Science — LLM fallbacks break agent pipelines). Treat a fallback as a degraded-mode signal worth logging, not a transparent substitution.
Circuit breakers for tool calls¶
A circuit breaker tracks consecutive failures for a tool and disables it after a threshold.
stateDiagram-v2
[*] --> Closed
Closed --> Open: N consecutive failures
Open --> HalfOpen: Timeout expires
HalfOpen --> Closed: Probe succeeds
HalfOpen --> Open: Probe fails
In the closed state, calls proceed and failures are counted. In the open state, calls are blocked and the agent uses alternatives, as the agent circuit breaker state machine specifies. In the half-open state, a single probe tests recovery.
Most coding agents use a lighter-weight version: count failures, tell the model the tool is unreliable, and let model-driven adaptation handle routing. Full state machines are more common in multi-agent systems with shared tool infrastructure.
The rollback-over-prevention philosophy¶
Let agents make recoverable mistakes rather than preventing all mistakes: sandbox execution, review gates before permanent effects, session trees for fork/explore/discard, checkpoints at every meaningful boundary. Restrictive permissions limit capability more than they reduce risk. See Rollback-First Design.
When this backfires¶
The progressive hierarchy adds latency and complexity. These conditions favor failing fast instead:
- Short-lived tasks with no side effects — a task under 30 seconds with no external writes gains nothing from recovery logic, and retry overhead exceeds the benefit.
- Cascading failures in multi-agent systems — when agents share infrastructure (databases, queues, tool APIs), recovery attempts amplify load on stressed components. Circuit breakers and agent backpressure must be coordinated across agents, not per-agent.
- Silent corruption without validation — recovery requires detection. Writing to external systems without output validation means an agent that "recovers" may compound bad state. Fail fast to a human when intermediate state cannot be verified.
Example¶
A coding agent tasked with refactoring a module hits a test failure after changing a function signature:
- Self-correct: the agent reads the test error, identifies the mismatched argument, and fixes the call site. Tests pass.
- On the next file, the same refactor produces a circular import. The agent retries twice and fails both times, which is safe only because the edits are idempotent.
- Fallback: the agent abandons the automated refactor for that file and applies a manual re-export to break the cycle.
- A third file depends on an external service that is down. The agent cannot run integration tests.
- Degrade gracefully: the agent commits the passing unit-tested changes and leaves the integration-dependent file unchanged, noting the skip in its progress log.
- The agent encounters a permissions error trying to update a protected config file.
- Escalate: the agent opens a draft PR with its completed work and flags the config change for human review, including the error message and the intended edit.
Throughout, the agent commits after each successful file change (git commit -m "refactor: update signature in <file>"), so any revert affects only one file.
Key Takeaways¶
- Treat failure as a progressive escalation — self-correct, then fallback, then degrade gracefully, then escalate — so recoverable errors never reach a human.
- Git is the cheapest recovery substrate: frequent commits and progress files turn a crash into a resumable checkpoint rather than lost work.
- Model-driven adaptation (tell the model the tool failed, let it reroute) beats rigid retry logic for novel errors — but only when the failure is observable, not silent.
- Durable-execution frameworks and circuit breakers add fault tolerance for crash-survival and unreliable tools; reach for them when state must outlive the process.
- Recovery requires detection — fail fast to a human whenever intermediate state cannot be validated.
Related¶
- Rollback-First Design
- Agent Circuit Breaker — per-tool state machine implementation with configuration thresholds
- Tail Control for Agent Workflows — percentile-based reliability framing — per-step p95 timeouts, hedged re-draws, graceful degradation — that names the workflow-level bound the progressive recovery hierarchy lives inside
- Idempotent Agent Operations
- Agent Harness
- Human-in-the-Loop Placement
- Loop Detection
- Trajectory Logging and Progress Files
- Agent Self-Review Loop