Observability-Driven Harness Evolution¶

Pair every harness edit with a self-declared prediction, then verify it against the next round's outcome. The mismatch, not the score, drives convergence.

The Mechanism¶

Autonomous harness evolution fails the same way: an agent edits prompts, tools, or middleware, scores degrade, and no one knows which edit caused the delta. Trajectories run into millions of tokens and edits compound. Without per-edit attribution, the loop collapses into trial-and-error.

Agentic Harness Engineering (AHE) instruments the loop with three observability pillars so each edit becomes a falsifiable contract (Lin et al., 2026):

Pillar	What it makes legible	Effect on the loop
Component observability	Every editable harness element has a file-level representation; the action space is explicit and revertible	Edits are scoped and rollback is one operation
Experience observability	Multi-million-token trajectories distilled into a layered, drill-down evidence corpus	The evolving agent can actually consume past runs as evidence
Decision observability	Each edit ships with a self-declared prediction, verified against the next round's outcomes	Per-edit attribution; predictions either match or falsify

graph TD
    A[Inspect trajectory corpus] --> B[Propose edit to harness component]
    B --> C[Declare prediction:<br/>'this edit will improve X by Y']
    C --> D[Apply edit to file-level component]
    D --> E[Run eval round]
    E --> F{Prediction verified?}
    F -->|Match| G[Keep edit; update model<br/>of what works]
    F -->|Falsified| H[Revert; recorded as<br/>diagnostic signal]
    G --> A
    H --> A

Why Predictions Convert Noise to Signal¶

Score-only loops produce one bit per round: better or worse. A predicted outcome produces two bits — score direction and prediction accuracy — and the second bit attributes the change to the agent's mental model rather than to chance.

An improvement with a falsified prediction signals an accidental win: the edit worked for a reason the agent did not understand. A regression with a matched prediction means the agent correctly anticipated it — useful for ruling out a hypothesis. This is hypothesis-driven debugging applied to harness mutations: the prediction is the hypothesis, the eval round is the experiment, the mismatch is the diagnostic.

Reflective optimization without this discipline collapses on defective seeds. Gao et al., 2026 measured GEPA dropping GSM8K accuracy from 23.81% to 13.50% on a poor seed prompt — opaque, label-free trajectories cannot escape local optima. Decision observability is the interpretable trace such optimizers lack.

Empirical Result¶

Ten AHE iterations lifted pass@1 on Terminal-Bench 2 from 69.7% to 77.0%, surpassing the human-designed Codex-CLI harness (71.9%) and self-evolving baselines ACE and TF-GRPO (Lin et al., 2026). The frozen harness transferred without re-evolution: top aggregate success on SWE-bench-verified at 12% fewer tokens than the seed, and +5.1 to +10.1pp gains across three alternate model families — evidence that the evolved components encode general engineering experience, not benchmark-specific tuning.

For comparison, LangChain's manually-driven changes on Terminal Bench 2.0 moved scores from 52.8% to 66.5% (LangChain, 2026). AHE automates that loop without losing attribution.

Relationship to Adjacent Patterns¶

Pattern	Scope	Driver
Harness Engineering	Discipline of designing agent environments	Human-led, ongoing
Harness Hill-Climbing	One-variable-at-a-time search using eval scores	Human-driven, no predictions
Agentic Flywheel	Closed-loop self-improvement at high level	Mixed autonomy tiers
Self-Rewriting Meta-Prompt Loop	Prompt edits only	Autonomous, weight-free
Observability-driven evolution	Full harness, file-level components	Autonomous, prediction-verified
Runtime Scaffold Evolution	In-session tool synthesis	Autonomous, ephemeral

Hill-climbing isolates one variable per iteration so attribution is mechanical; AHE isolates predictions per edit so attribution is semantic. The two are compatible — single-variable change reduces the surface area each prediction must cover.

When This Backfires¶

Defective seed harness — the loop assumes the agent's prior model is roughly correct; on a degenerate seed the same opacity that traps GEPA can trap AHE. A pre-loop validator on the seed is required, not just a per-edit gate.
Weak benchmarks — verified predictions only matter against an eval that captures real failure modes. A benchmark rewarding surface patterns lets the loop converge to a local maximum that fails in production. Rotate eval tasks; see incident-to-eval synthesis.
Sub-frontier models — predicting edits to your own harness is meta-reasoning. AHE was evaluated on frontier models; weaker ones would likely produce miscalibrated predictions that degrade signal, mirroring the capability threshold in runtime scaffold evolution.
Narrow-scope agents — file-level component representations, layered trajectory corpora, and prediction registries are infrastructure work. For small task sets, manual edits reach good-enough faster.

Example¶

A team's autonomous coding agent has plateaued at 64% pass@1 on their internal eval suite. They wire AHE around the existing harness:

Component observability — each prompt fragment, tool description, and middleware hook becomes a file in harness/components/. Edits are git commits; rollbacks are reverts.

Experience observability — a pipeline distills each run into a hierarchical record: per-task summary on top, per-turn rationale below, full token stream at the leaves. The agent reads the top two layers and drills down only on flagged failures.

Decision observability — proposing an "import-cycle check" for the pre-completion checklist, the agent writes a prediction:

edit_id: 2026-04-30-checklist-import-cycle
component: harness/components/precompletion-checklist.md
prediction:
  metric: pass@1
  direction: increase
  magnitude_pp: "+1.5 to +3.0"
  confidence: medium
  rationale: "12% of failed runs in last 50 traces hit circular imports
  that completion left in place; checklist gate should catch them."

The next round scores 67.1% — within range; the edit is kept and the accuracy log updated. A later edit predicts +5pp from a tool-description change but scores −0.4pp; it is reverted, the falsified prediction logged as a signal that the agent's model of tool selection is incomplete. After ten cycles, every retained edit has a verified prediction and every reverted edit a logged falsification.

Key Takeaways¶

The unique mechanism is predictions paired with edits, not the loop — the contribution is per-edit attribution, not autonomy
Component observability and revertible edits are prerequisites: without them, predictions cannot be scoped to a single change
Verified-prediction discipline is the interpretable trace that opaque reflective optimizers (Gao et al., 2026) lack
Evidence is on frontier models against held-out benchmarks; defective seeds, weak benchmarks, and sub-frontier models are documented failure modes
Combine with single-variable discipline from harness hill-climbing — narrow edits make narrow predictions, which are easier to falsify

Harness Engineering — manual harness discipline that AHE automates
Harness Hill-Climbing — eval-driven local search, one-variable-at-a-time, human-driven
Agentic Flywheel — closed-loop self-improvement framework that AHE instantiates with observability pillars
Self-Rewriting Meta-Prompt Loop — autonomous prompt rewriting (subset of harness components)
Runtime Scaffold Evolution — ephemeral in-session tool synthesis
Hypothesis-Driven Debugging — same predict-then-verify logic applied to bug fixes
Rollback-First Design — reversibility as a precondition for the loop
Incident-to-Eval Synthesis — sourcing eval tasks from real failures so verified predictions track production reality