Five-Failure-Layers Diagnostic: Attribute Before Swapping the Model¶
Force every observed agent failure through a fixed harness-layer attribution before swapping models. "The model is dumb" almost always resolves to a specific layer.
The Discipline¶
When a coding agent fails, the default reaction is to suspect the model. The discipline inverts that: name the harness layer responsible first, swap the model only after every layer has been audited and ruled out. A fixed enumeration enforces the discipline — open-ended root-cause exploration lets teams gravitate to "try a bigger model" because that conclusion is easier than fixing the actual gap.
The five working layers, derived from harness engineering literature (OpenAI, Anthropic, LangChain):
graph TD
F[Agent failure] --> T[Task specification]
F --> C[Context provision]
F --> E[Execution environment]
F --> V[Verification feedback]
F --> S[State management]
T -.-> M[Swap model<br/>only after all 5 cleared]
C -.-> M
E -.-> M
V -.-> M
S -.-> M
The Five Layers¶
| Layer | Diagnostic question | Canonical fix |
|---|---|---|
| Task specification | Did the prompt define a verifiable goal, or did the agent have to guess? | Write an explicit Definition of Done — endpoints, schemas, commands (OpenAI: humans specify intent) |
| Context provision | Could the agent find architectural conventions, version pins, and project rules in the repo? | A short AGENTS.md (~100 lines) pointing into structured docs/ (OpenAI) |
| Execution environment | Did the agent waste context on broken dev setup, missing deps, or wrong tool versions? | An init.sh the agent reads at session start; isolated worktree-per-task |
| Verification feedback | Were there machine-checkable signals (tests, lint, type checks) the agent could run, or did it self-grade? | Wire pre-completion checks; pre-completion checklists and CI gates close the verification gap |
| State management | Did the agent recover prior progress, or did each session re-discover everything? | Progress files, structured feature lists, git commit log; Anthropic uses a JSON feature list because models edit Markdown more freely (Anthropic) |
The 5-layer enumeration is one working cut. Anthropic's own table names four (premature victory, broken environment, premature feature-completion, undocumented run procedure); agent-debugging names a different four (missing context, conflicting instructions, missing tools, capability ceiling). The exact count matters less than the discipline of fixed attribution — pick a working enumeration and force every failure through it.
Why a Fixed Enumeration¶
Free-form root-cause analysis lets the team invent new layers to avoid investigating the real one. A fixed list closes that escape. The cost — that genuinely novel failure modes get mis-attributed — is real but bounded; the more common failure is teams concluding "the model isn't good enough" without checking the five layers (Anthropic: model swap is the most expensive option).
Independent quantification: LangChain raised Terminal Bench 2.0 from 52.8% to 66.5% through pure harness changes — no model change (LangChain). The layers were the bottleneck.
Diagnostic Loop¶
- Run the agent. Observe the failure.
- Attribute to one of the five layers — the agent debugging step. If unattributable, add it to a separate "novel failure" log — do not invent a sixth bucket on the fly. Recent work operationalizes this attribution step directly: a method that localizes which harness layer is responsible for a failure from failed-trajectory evidence, rather than leaving the layer to a manual guess (From Failed Trajectories to Reliable LLM Agents). Runtime harness adaptation takes the next step — turning each attributed failure into a rule, skill, validator, or monitor at the matching interface layer.
- Fix that layer. Commit the fix back into the repo so all future sessions inherit it, the loop runtime harness adaptation automates.
- Re-run the same task. If it succeeds, the attribution was correct.
- If all five layers have been cleared on the same task class and the agent still fails — only then evaluate a model swap.
Example¶
A team using Claude Sonnet 4 to add an API endpoint observes repeated failures: code runs locally, fails in staging due to wrong SQLAlchemy syntax. Default reaction: "we need Opus."
Layer-by-layer attribution:
- Task specification: prompt was "add user preferences endpoints under
/api/v2/users" — no schema, no auth contract, no completion criteria. Gap identified. - Context provision: no
AGENTS.md; project conventions (SQLAlchemy 2.0, OAuth 2.0) lived only in Slack. Gap identified. - Execution environment: dev container worked; not the layer.
- Verification feedback: no pre-completion hook running
pytest && mypy; the agent self-declared completion. Gap identified. - State management: single-session task; not the layer.
The team writes an AGENTS.md (~80 lines, version pins and conventions), adds completion criteria to the prompt template, wires a pre-commit hook running pytest tests/api/ && mypy src/. Same model succeeds across three independent runs at ~60% better context efficiency (Walking Labs: Lecture 01 — a re-tellable scenario the lecture documents).
The model was never the problem. The harness was.
When This Backfires¶
- Novel failure modes outside the enumeration — prompt-injection-induced behavior, cross-session memory poisoning, MCP tool-call hallucinations against a freshly added server. Forcing these into the five buckets produces wrong fixes. Keep an explicit "novel failure" log so genuinely new modes accumulate visibly instead of being mis-attributed.
- Mature harnesses near the capability ceiling — once a team has spent months on legibility, mechanical enforcement, and verification, remaining failures genuinely concentrate at the model layer (consistent capability fallacy). The five-layer audit becomes ceremony that delays the right call.
- Single-engineer prototypes — the discipline assumes a team that benefits from a shared vocabulary. Solo work on throwaway code pays the ceremony cost without the coordination return.
Key Takeaways¶
- "The model is dumb" almost always resolves to a specific harness layer once forced through a fixed attribution.
- Use a working enumeration of five layers — task spec, context, execution environment, verification, state — but treat the exact count as a working cut, not canonical. The discipline of fixed attribution is what matters.
- Model swap is the last hypothesis, not the first. LangChain demonstrated +13.7 points on Terminal Bench 2.0 from harness changes alone.
- Keep a "novel failure" log so the enumeration cannot quietly grow to fit anything.
Related¶
- Harness Engineering — the discipline the five layers operationalize as a diagnostic loop
- Agent Debugging — a different working enumeration (4 modes) for diagnosing bad agent output
- The Consistent Capability Fallacy — why "the model already passed a harder task" is not evidence
- Pre-Completion Checklists — the verification-feedback layer in concrete form
- AGENTS.md as Table of Contents — the context-provision layer in concrete form
- Trajectory Decomposition Diagnosis — finer-grained diagnosis below the harness-layer level
- Harness Engineering Method Map — design dimensions the layer fixes draw from
- Runtime Harness Adaptation — places each diagnosed failure at one of four interface layers in deterministic, rule-governed environments