Runtime Harness Adaptation¶
On deterministic, rule-governed environments, evolve a four-layer harness — environment contracts, procedural skills, action realization, trajectory regulation — from interaction-failure trajectories. The frozen model gets the lift; the layers transfer across backbones to the extent the encoded structure is environment-side, not model-side.
Runtime harness adaptation fixes recurring LLM-agent failures by editing the model-environment interface, not the model. Each recurring failure in a training trajectory becomes a rule, skill, validator, or monitor at one of four layers; the harness is held fixed at evaluation. Xu et al. (2026) report 116 of 126 model-environment settings improved across 18 backbones — average +88.5% relative — with harnesses evolved from a single 4B model transferring to 17 others.
When the Qualifier Holds¶
The technique only generalises in deterministic, rule-governed environments with stable tool interfaces and stable success criteria. The authors flag fully open-ended tasks as outside scope (Xu et al., 2026). The seven benchmark environments — Airline, Retail, Telecom (τ-bench / τ²-bench), ALFWorld, WebShop, OS Interaction, DBBench (AgentBench) — share fixed APIs, stable policy documents, reproducible grading.
Coding-agent work sits between: refactoring a typed codebase with linters and tests is rule-governed; building a novel feature from a vague prompt is not. Use the four-layer surface for the rule-governed slices.
The Four Adaptation Layers¶
graph LR
M[Frozen model] --> EC[Environment<br/>Contract]
EC --> PS[Procedural<br/>Skills]
PS --> AR[Action<br/>Realization]
AR --> E[Environment]
E --> TR[Trajectory<br/>Regulation]
TR --> M
| Layer | What it does | Failure mode it catches |
|---|---|---|
| Environment contract | Makes stable constraints, policy clauses, tool-use rules, and known pitfalls explicit before the first turn | Valid syntax, wrong tool usage — model never saw the rule |
| Procedural skill | Skill library distilled from training trajectories; retrieves task-relevant skills as non-parametric guidance | Reasoning gaps the model could fill if shown the right procedure once |
| Action realization | Gate between model output and environment; verifies executability, canonicalises interface errors, blocks deterministically failing actions | Action intent unclear in executable form, repeated bad-argument calls |
| Trajectory regulation | Post-execution monitor for repetition, stagnation, budget exhaustion; triggers recovery | Degenerate loops, premature termination, runaway budgets |
Each layer is sourced from observed trajectory failures, not from a priori design. The harness evolves: new failure → new rule, skill, validator, or monitor at the matching layer (Xu et al., 2026).
Why It Works¶
Deterministic, rule-governed environments expose a stable interface the model has not memorised but the harness can encode externally. Failures here cluster at the model-environment seam — invalid tool arguments, wrong API format, repetition loops from ambiguous feedback — not at reasoning ability. Encoding the interface once and gating execution against it converts per-call inference into retrieval and gating, which LLMs perform more consistently than reconstruction from weights (Zhou et al., 2026 — externalization survey; Xu et al., 2026). Cross-backbone transfer holds only to the extent that what is encoded is environment-side, not model-side — when an "environment" rule is actually a quirk of the source model's tool-call style, transfer collapses (Cursor).
When This Backfires¶
- Open-ended tasks. When tool interfaces, success criteria, or contracts shift per task — research agents, exploratory coding, novel-feature builds — the four layers have no stable surface to encode. Authors flag this directly (Xu et al., 2026). Reach for harness impermanence instead.
- Frontier model migrations. When the next model handles an action natively — structured output, native tool-call repair, internal stopping — action-realization and trajectory-regulation layers become depreciating capital. Cursor measured a 30% drop on GPT-5-Codex when reasoning summaries were stripped by a harness rule designed for a prior model (Cursor).
- Provider-specific overfit. Skills or contracts distilled from one model's trajectories can mask model-specific bias. Procedural-skill retrieval can fire a skill that contradicts the deployed model's preferred pattern (per-model harness tuning; Cursor).
- No eval loop. Evolution assumes a deterministic benchmark that can credibly score interventions. Without one, layers accumulate without falsification — see observability-driven harness evolution.
Practical Application¶
- Confirm the qualifier. Deterministic and rule-governed? If not, prefer the five-failure-layers diagnostic.
- Mine failure trajectories. Collect runs that miss the success criterion; cluster by failure category.
- Place each fix at the right layer. Policy violation → environment contract. Reasoning skip → procedural skill. Bad argument → action realization. Loop or stall → trajectory regulation. Mis-placement is a transferability bug.
- Hold the harness fixed at evaluation. Adapt from training trajectories; evaluate the frozen harness on held-out tasks.
- Re-ablate on model swap. Reported transfer can hide model-side rules. Run isometric harness ablation per backbone; drop rules whose lift evaporates.
Key Takeaways¶
- The four-layer surface — environment contract, procedural skills, action realization, trajectory regulation — is the unit of harness adaptation; place each fix at the matching layer.
- Cross-model transfer holds only to the extent the encoded structure is environment-side. Re-ablate on every model swap.
- The technique is scoped to deterministic, rule-governed environments. Open-ended tasks defeat the assumption — use harness impermanence and per-model harness tuning instead.
- Evolution from failure trajectories assumes a deterministic eval loop. Without one, layers accumulate without falsification.
Related¶
- Harness Engineering — the broader discipline of designing agent environments
- Per-Model Harness Tuning — when transfer breaks, declare model-keyed overrides
- Isometric Harness Ablation — quantify which layer earns its weight on each backbone
- Five-Failure-Layers Diagnostic — attribute failures before reaching for a model swap
- Harness Impermanence — design layers for removal when the next model subsumes them