Trajectory Decomposition: Diagnose Where Coding Agents Fail¶
Decompose an agent's trajectory into search, read, and edit stages, scoring each independently to diagnose where a failing run went wrong.
The problem with binary outcomes¶
pass@k metrics tell you whether an agent solved a problem. Outcome grading tells you whether the final state is correct. Neither tells you where the agent went wrong when it fails.
A coding agent that fails a SWE-bench task could have failed at any point: wrong files, wrong functions, or wrong edits. Binary metrics collapse these into a single "fail," making targeted improvement impossible.
Three-stage decomposition¶
The TRAJEVAL framework decomposes every agent trajectory into three stages, each measured with standard information retrieval metrics. [Source: TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis]
graph LR
S[Search<br/>File localization] --> R[Read<br/>Function comprehension]
R --> E[Edit<br/>Modification targeting]
S -.-> SP[Precision: did it open<br/>only relevant files?]
S -.-> SR[Recall: did it find<br/>all necessary files?]
R -.-> RP[Precision: did it read<br/>only needed functions?]
R -.-> RR[Recall: did it read<br/>all needed functions?]
E -.-> EP[Precision: did it edit<br/>only the right locations?]
E -.-> ER[Recall: did it edit<br/>all required locations?]
| Stage | What it measures | Precision question | Recall question |
|---|---|---|---|
| Search | File localization | Did it open only relevant files? | Did it find all necessary files? |
| Read | Function comprehension (semantic context loading) | Did it examine only needed functions? | Did it examine all needed functions? |
| Edit | Modification targeting | Did it change only the right locations? | Did it change all required locations? |
Compare each stage against the reference patch to compute precision and recall independently — the complement to outcome grading, which scores only the final state.
Stage independence is why this works: you compute precision and recall at each stage against the same reference, so a failure in one stage does not distort the scores in others.
What the evidence shows¶
Analysis of 16,758 trajectories across three architectures and seven models reveals patterns that binary metrics hide. [Source: TRAJEVAL]
Universal over-reading¶
All agents examine about 22x more functions than necessary — a structural property of how agents explore code, not a model-specific bug. Reducing read scope has the highest ROI for most configurations.
Model-specific failure stages¶
Different models fail at different stages:
| Model | Primary failure stage | Implication |
|---|---|---|
| GPT-5 | Edit (targets wrong locations) | Improve edit targeting — search and read are adequate |
| Qwen-32B | Search (misses files entirely) | Improve file discovery — edits are accurate when it finds the right code |
A single Pass@1 score would rank both equally. Stage decomposition reveals they need opposite interventions.
Predictive power¶
Stage-level metrics predict Pass@1 within 0.87-2.1% MAE at the model level, reconstructing aggregate outcomes while providing richer diagnostic signal.
Applying this in practice¶
1. Log trajectories with stage boundaries¶
Capture which files were opened (search), which functions were read (read), and which locations were modified (edit). Trajectory logging provides the capture layer.
2. Compute per-stage precision and recall¶
For each failed task, compare agent actions against the reference at each stage:
search_precision = |files_opened ∩ files_in_patch| / |files_opened|
search_recall = |files_opened ∩ files_in_patch| / |files_in_patch|
Apply the same formula at the read and edit levels.
3. Diagnose before optimizing¶
| Symptom | Stage bottleneck | Likely fix |
|---|---|---|
| Low search recall | Agent misses relevant files | Better repository maps, improved file discovery tools |
| Low read precision | Agent reads too many functions | Tighter context filtering, semantic context loading |
| Low edit precision | Agent modifies wrong locations | More specific edit instructions, constraint-based prompting |
4. Inject real-time feedback¶
Stage-level signals extend beyond post-hoc analysis. Feeding trajectory diagnostics back during execution improved two models by 2.2-4.6 percentage points while reducing token costs by 20-31% — aligning with agent self-review loops. [Source: TRAJEVAL]
When to use it, and when not to¶
Outcome grading is the right default — it avoids penalizing valid alternative solutions. Add trajectory decomposition when an agent is failing and you need to know why, when comparing models by failure profile, or when deciding which component (search, context, edit) to improve next.
Skip it when:
- No reference patch: precision and recall require a known-correct solution. Open-ended tasks and production settings without ground truth cannot be evaluated this way — fall back to outcome grading, which needs no reference patch.
- Non-sequential stages: the model assumes forward-linear traversal. Agents that interleave stages (read → search → read → edit) produce ambiguous per-stage metrics.
- Uninstrumented trajectories: stage decomposition requires trajectory logs that separate file-open, function-read, and location-edit events. Agents wrapped in opaque APIs or sandboxes cannot be decomposed.
Key Takeaways¶
- Binary pass/fail metrics hide where agents fail — decompose into search, read, and edit stages
- All current agents over-read by ~22x — reducing read scope is the highest-ROI optimization for most setups
- Different models fail at different stages — diagnose first, then apply model-specific fixes
- Stage-level feedback during execution (not just post-hoc) improves outcomes and reduces cost
- Use outcome grading for scoring, trajectory decomposition for diagnosis — they serve different purposes
Related¶
- pass@k Metrics — the aggregate metric trajectory decomposition diagnoses
- Outcome Grading — complement to trajectory decomposition for scoring
- Completion Failure Taxonomy — categorizes why code suggestions fail
- Behavioral Testing for Non-Deterministic AI Agents — stage-level behavioral verification approach
- Trajectory-Opaque Evaluation Gap — where outcome-only grading misses safety and robustness signals that trajectory-aware auditing catches