Skip to content

Agentic Flywheel: Self-Improving Agent Systems

A closed loop where agents analyze their own operational data -- traces, test results, pipeline metrics -- and generate harness improvements that make all future agent work better, not just the current task.

Why a Flywheel

Most agent improvement is manual: a developer observes a failure, updates a prompt, and retries. The continuous agent improvement workflow formalizes this but keeps a human in the critical path.

The flywheel closes the loop. Agents analyze their own performance and propose harness changes -- prompts, tools, middleware, verification checks -- compounding improvement without a human at every step.

graph TD
    A[Agent executes task] --> B[Collect traces & test results]
    B --> C[Trace analyzer identifies failure patterns]
    C --> D[Generate harness modifications]
    D --> E{Approval gate}
    E -->|Interactive| F[Human reviews & applies]
    E -->|Backlog| G[Added to product queue]
    E -->|Autonomous| H[Auto-applied with monitoring]
    F --> A
    G --> A
    H --> A

Four Stages

Stage Activity Existing pattern
Embed signals Add self-verification, tests, and quality checks so agents can gauge their own output Pre-completion checklists, shift-left testing
Analyze traces Mine execution traces for failure patterns, focusing on cases that failed in previous runs (boosting) Agent transcript analysis
Generate modifications Produce targeted harness changes: new middleware, updated prompts, adjusted tool configurations Introspective skill generation
Escalate approval Route modifications through an approval tier matched to confidence and risk Progressive autonomy with model evolution

The stages form a closed loop improving the system's own infrastructure, not individual task outputs.

Boosting: Learning from Failures

Boosting concentrates analysis on prior failure cases:

  1. Run a batch of agent tasks and collect traces
  2. Filter to failures -- tasks that failed tests, produced rejected PRs, or triggered loop detection
  3. Spawn parallel analysis agents, each examining a cluster of related failures
  4. Synthesize findings into harness modifications

LangChain demonstrated this on Terminal Bench 2.0: harness-only improvements (self-verification loops, context injection, loop detection, reasoning budgets) improved scores from 52.8% to 66.5% -- a 13.7-point gain with no model change (Improving Deep Agents with Harness Engineering).

Escalating Autonomy for Modifications

Not every harness change should be auto-applied. Kief Morris describes three levels (Humans and Agents in Software Engineering Loops), which map onto the humans/agents loop positioning modes:

Level Mechanism When to use
Interactive Human reviews each recommendation and selectively applies Novel failure modes, security-sensitive middleware changes
Backlog Agent adds suggestions to the product queue for later triage Improvements needing broader discussion or affecting multiple projects
Autonomous High-confidence recommendations auto-apply with monitoring Well-tested, narrow-scope changes with rollback capability (e.g., adjusting a retry count, adding a lint rule) -- see Rollback-First Design

Start at interactive. Move to autonomous only for categories with a proven track record — Morris's framing reserves the autonomous tier for changes with a tight rollback path and a narrow blast radius.

Harness Modifications That Work

Effective flywheel improvements target the harness, not the model.

  • Reasoning sandwich -- allocate maximum reasoning compute for planning and verification, moderate for implementation (xhigh-high-xhigh). Running maximum throughout caused timeouts; the sandwich pattern scored 63.6% vs. 53.9% for uniform maximum (Improving Deep Agents with Harness Engineering). See reasoning budget allocation.
  • Pre-completion checklist middleware -- intercepts the agent before exit and forces verification against the task spec, preventing premature completion — the Ralph Wiggum loop as middleware.
  • Loop detection middleware -- tracks per-file edit counts and injects a reconsideration prompt after N edits, breaking doom loops (loop detection).

Failure Modes

Risk Mitigation
Objective drift Context compression shifts the analyzer off original goals. Stress-test summarization to surface deviations (objective drift).
Compounding bad changes An autonomous modification passes initial tests but degrades edge cases (rollback-first design limits the damage). A/B evaluate on a held-out task set before promoting.
Over-fitting to benchmarks Harness optimizes for a specific eval suite, not general capability. Rotate eval tasks and include unseen scenarios.
Regression-prediction asymmetry Self-evolving agents predict what their edits fix far better than what they break — a nine-round study reported 33.7% fix precision against 11.8% regression precision (Auto Agentic Harness Engineering, 2026). Assume regression blindness; require each change to enumerate expected fixes and plausible breakages, and verify against a held-out rollout.
Analyzer reward hacking The trace-summarising analyzer is itself an LLM. May 2026 benchmarks show heavily RL-trained models exploit shortcuts on 13.9% of multi-step tasks, with most cheating episodes carrying chain-of-thought that frames the cheat as legitimate (Reward Hacking Benchmark, May 2026). Cross-check proposed modifications against the raw trace, not the analyzer's narrative.

Example

LangChain's Terminal Bench 2.0 run illustrates the flywheel stages concretely (Improving Deep Agents with Harness Engineering):

  1. Embed signals: Pre-completion checklist middleware intercepted the agent before exit, forcing verification against the task spec
  2. Analyze traces: Trace review surfaced recurring failure clusters -- premature completion, doom loops, and uniform-maximum reasoning timeouts
  3. Generate modifications: Targeted harness changes followed -- self-verification loops, loop-detection middleware tracking per-file edit counts, and the xhigh-high-xhigh reasoning sandwich
  4. Escalate: Each modification was evaluated on the held-out Terminal Bench task set before promotion; the combined harness changes lifted scores from 52.8% to 66.5% with no model change
Feedback