Skip to content

Isometric Harness Ablation

Pin the model, remove one harness subsystem at a time, rerun the benchmark, record the drop. The per-subsystem drop table ranks where to invest next.

Related lesson: Eval-Driven Harness Improvement — this concept features in a hands-on lesson with quizzes.

The Methodology

Harness engineering treats the agent environment as the dominant lever on output quality. Isometric ablation quantifies which part of that environment is doing the work. The walkinglabs course calls it the "controlled variable exclusion test": keep the model fixed, remove subsystems one at a time, measure which removal causes the biggest drop (walkinglabs). Isometric ablation names the same procedure here, by analogy with isometric exercise: hold one variable fixed while loading another.

graph TD
    A[Pin model + benchmark slice] --> B[Baseline run: full harness]
    B --> C[Remove subsystem 1]
    C --> D[Rerun, record drop]
    D --> E[Restore, remove subsystem 2]
    E --> F[Rerun, record drop]
    F --> G[...repeat for each subsystem]
    G --> H[Rank by drop magnitude]
    H --> I[Invest in highest-drop subsystem]

The five subsystems to ablate, following the five-subsystem harness model from the same source (walkinglabs):

Subsystem Concrete artifact What removing it tests
Instructions AGENTS.md / CLAUDE.md How much of the result depended on project context
Tools Shell, file edit, test runner access How much of the result depended on action affordances
Environment Lockfiles, runtime versions, services How much of the result depended on reproducible runtime
State PROGRESS.md, commits, session memory How much of the result depended on cross-turn continuity
Feedback Verification commands, lint, test signals How much of the result depended on closed-loop correction

The Output Table

Each ablation run produces one row. The table is the deliverable.

Removed subsystem Baseline score Ablated score Drop
(none — baseline) 80%
Instructions 80% 35% 45 pp
Tools 80% 0% 80 pp
Environment 80% 60% 20 pp
State 80% 75% 5 pp
Feedback 80% 50% 30 pp

The drops rank the subsystems. The operational rule: upgrade the highest-drop subsystem first. Near-zero drops mark simplification candidates — they consume maintenance budget without earning their place (walkinglabs).

Why the Same-Model Constraint Matters

The "isometric" qualifier is load-bearing. Changing model and harness together confounds the score delta — you cannot attribute it to either lever. Pinning the model makes the agent a function of environment alone, so the delta measures environmental marginal product — the standard ablation argument applied to non-model components (arxiv 2604.25850: Agentic Harness Engineering).

Anthropic's 2D retro-game-maker comparison shows the scale: same prompt, same model class, two harnesses. A solo agent produced a non-functional prototype in 20 minutes for $9; a Planner + Generator + Evaluator harness produced a working application in 6 hours for $200. After upgrading to Opus 4.6, Anthropic dropped the sprint construct but kept planner and evaluator — one row in an isometric ablation table (Anthropic: Harness design for long-running application development).

Pairing With Hill-Climbing and Impermanence

The methodology slots between two adjacent practices:

  • Harness hill-climbing optimizes a single dimension — one change per iteration, accept if the score improves. It assumes you already picked the right dimension; isometric ablation tells you which. The ablation table maps the terrain; the climber traverses it.
  • Harness impermanence flags scaffolding for deletion when native model capability subsumes it. Near-zero ablation drop is the leading indicator: if removing a subsystem barely changes the score, the model already does the work the scaffold provided. Add them to the simplification log.

The full cycle: ablate to rank, hill-climb the top-ranked subsystem, re-ablate to confirm the rank changed, retire low-drop subsystems.

When This Backfires

Three failure conditions matter:

  • No graded benchmark slice exists. The methodology needs a representative, deterministic eval set. Without one, "remove instructions, rerun" produces noisy drops dominated by per-run variance — the same precondition that gates harness hill-climbing and any eval-driven improvement loop.
  • Components interact non-additively. When feedback and state co-depend — state captures eval output that feedback consumes — removing one alone undercounts it. The Agentic Harness Engineering paper found "harness components interact non-additively, so stacking effective edits caps the aggregate gain" (arxiv 2604.25850). Single-component ablation ranks; it does not quantify in isolation.
  • High per-run variance swamps small drops. On a mature harness with small remaining drops, single-trial ablation cannot separate signal from noise. Use pass^k or multi-trial averaging before trusting drops below your noise floor.

A near-zero drop is also not proof a subsystem is useless: others can compensate when one is removed, masking its true contribution — the compensatory-masquerade caution from neural-network ablation studies. Treat drops as a ranking signal, not a precise measurement.

Example

A team using GPT-4o on a TypeScript + React frontend codebase (~20,000 LOC) ran the methodology by adding subsystems instead of removing them, fixed-model throughout (walkinglabs):

Stage Subsystems present Success rate
1 — bare baseline README only 20%
2 — add instructions AGENTS.md with stack, conventions 60%
3 — add feedback Verification commands listed 80%
4 — add state PROGRESS.md between sessions 80–100%

Reading the same data as an ablation table: removing instructions costs 40 pp, feedback 20 pp, state at most 20 pp. Instructions is the load-bearing subsystem here — invest there first. State is the smallest contributor and the candidate to drop if maintenance gets costly, the retire-cleanly case harness impermanence handles.

Order-of-addition is a real artifact: a different order yields different per-stage deltas. The ranking is more stable than the exact magnitudes.

Key Takeaways

  • Pin the model, remove one of {instructions, tools, environment, state, feedback}, rerun the benchmark, record the drop — the resulting table ranks investment priorities
  • The same-model constraint converts the score into a measurement of environmental marginal product; changing both model and harness conflates two signals
  • Subsystems with near-zero drop are simplification candidates — pair with harness impermanence to retire them cleanly
  • The methodology requires a graded benchmark slice; without one, drops are noise rather than signal
  • Single-component ablation ranks subsystems but does not quantify them precisely — components interact non-additively
  • Anthropic's same-prompt, same-model retro-game-maker comparison (solo agent vs. Planner+Generator+Evaluator harness) is a worked instance of the methodology at scale
Feedback