Harness Hill-Climbing: Eval-Driven Iterative Improvement of Agent Harnesses¶

Use eval scores as the optimization signal to systematically improve agent harness configuration, replacing ad-hoc prompt tweaking with a structured feedback loop.

Learn it hands-on: Eval-Driven Harness Improvement — guided lesson with quizzes.

The loop¶

Harness hill-climbing applies local search to agent configuration: run a benchmark suite, make one targeted change, re-score, keep the change if the score improves — the manual counterpart to automated DSPy prompt search. Repeat. No model changes. No retraining. The eval score is the gradient signal.

graph TD
    A[Baseline eval run] --> B[Generate candidate change]
    B --> C[Score candidate on task suite]
    C --> D{Score improved?}
    D -->|Yes| E[Adopt change as new baseline]
    D -->|No| F[Discard change]
    E --> B
    F --> B

LangChain applied this on Terminal Bench 2.0 and moved from 52.8% to 66.5% through harness-only changes (LangChain: Improving Deep Agents with Harness Engineering). Each iteration targeted one variable at a time.

What to tune¶

Tunable variables with measurable impact:

Component	What changes	Signal
System prompt wording	Phrasing of constraints, persona, output format	Task pass rate
Tool descriptions	What each tool claims to do; inclusion/exclusion of examples	Tool call accuracy
Reasoning budget	Token allocation for planning vs. implementation phases	Score vs. cost
Pre-completion checklist	Verification steps agent runs before declaring done	Premature-exit rate
Loop-detection thresholds	Edit counts or retry limits before intervention	Loop frequency
Context injection timing	When reference docs or prior state load into context	Task coherence

The reasoning sandwich pattern is a concrete example: allocating maximum reasoning compute for planning and verification phases with moderate compute for implementation scored 63.6% vs. 53.9% for uniform maximum — a measurable delta from a single configuration change (LangChain).

Eval design for tuning¶

The task suite must be representative and held out from production — otherwise you measure the eval fixture, not real capability.

Isolation: use a separate set for tuning and a second held-out set for final validation. Never tune against the validation set. Same discipline as train/validation/test splits in model training — and the same defense against the held-out test gap.

Breadth: include tasks where the target behavior should trigger and tasks where it should not. A harness optimized only on positive cases will over-trigger. Anthropic's eval guidance specifies testing both directions explicitly (Demystifying Evals for AI Agents).

Grading: prefer deterministic outcome graders (pass/fail, schema checks) over LLM-as-judge for the tuning loop — cheaper to run repeatedly, eliminates evaluator variance from the signal. Use pass^k rather than single-trial pass rate when consistency matters.

Overfitting risk¶

A harness tuned to a specific eval suite can score high on that suite while degrading on real workloads — it over-indexes on surface patterns in eval tasks rather than the underlying capability.

Signs: the tuning-suite score keeps rising while production error rates stay flat or increase; harness changes that "work" are narrow prompt additions that match eval phrasing; the held-out validation score does not track the tuning score.

Mitigations:

Rotate eval tasks: periodically replace tuning tasks with fresh ones drawn from production traces; see Incident-to-Eval Synthesis
Held-out validation: run a final check on a task set that never touched the tuning loop before promoting a harness change
Monitor production: treat the eval score as a leading indicator; production outcomes are ground truth

When this backfires¶

Hill-climbing finds a local optimum, not a global one — if the baseline sits in a poor region of configuration space, iteration converges to the nearest local peak. Three further conditions degrade the loop:

Benchmark cost exceeds benefit: building a graded task suite takes real effort. For narrow-scope agents, ad-hoc prompt editing reaches good-enough performance faster.
Component interdependencies: single-variable iteration assumes harness components are approximately orthogonal — where they are not, isometric harness ablation ranks per-subsystem contribution instead. When prompt phrasing, tool descriptions, and reasoning budget interact, changing one variable masks or amplifies the effects of another.
Benchmark-to-production drift: the eval suite is a snapshot. If the production workload shifts after tuning, the optimized configuration may degrade on new task types — see Incident-to-Eval Synthesis.

One change at a time¶

The hill-climbing loop depends on isolating variables. Changing system prompt wording and tool descriptions in the same iteration conflates two signals — you cannot attribute a score delta to either change.

Single-variable changes make rollback unambiguous; multi-variable changes require untangling which component caused the regression. Same principle as incremental verification: small, checkpointed steps, each reversible. Persist each accepted baseline in the benchmark-snapshots table of the quality score rubric so a later regression has a date-stamped configuration to revert to.

Relationship to continuous improvement¶

Hill-climbing is an eval-mediated version of the continuous agent improvement loop — that loop uses human observation; hill-climbing substitutes measurement. Use continuous improvement to identify which component to target, then hill-climbing to find the best configuration.

The agentic flywheel extends this: agents propose candidate changes automatically, with the eval loop as the validation gate.

Key Takeaways¶

Run a baseline eval suite, change one harness variable, re-score, keep or discard; repeat
LangChain moved Terminal Bench 2.0 from 52.8% to 66.5% through harness-only changes using this loop
Tune on one task set; validate on a separate held-out set; monitor production outcomes as ground truth
Prefer deterministic outcome graders (test suites) over LLM-as-judge for tight iteration cycles
Changing one variable per iteration makes score changes attributable and rollback unambiguous
Eval overfitting is real: rotate tasks and include out-of-distribution scenarios to catch it

Agent Harness — the initializer + coding agent structure that harness hill-climbing optimizes
Harness Engineering — designing reliable agent environments
Agentic Flywheel — automated harness self-improvement using the same eval signal
Continuous Agent Improvement — human-driven observation-to-update loop
pass@k and pass^k Metrics — capability vs. consistency metrics for eval measurement
Incident-to-Eval Synthesis — sourcing eval tasks from production failures
Reasoning Budget Allocation — reasoning sandwich as a concrete tunable component
DSPy Programmatic Prompt Optimization — automated prompt search as an alternative to the manual hill-climbing loop, using the same eval signal