Skip to content

Trajectory Decomposition: Diagnose Where Coding Agents Fail

Decompose an agent's trajectory into search, read, and edit stages, scoring each independently to diagnose where a failing run went wrong.

The problem with binary outcomes

pass@k metrics tell you whether an agent solved a problem. Outcome grading tells you whether the final state is correct. Neither tells you where the agent went wrong when it fails.

A coding agent that fails a SWE-bench task could have failed at any point: wrong files, wrong functions, or wrong edits. Binary metrics collapse these into a single "fail," making targeted improvement impossible.

Three-stage decomposition

The TRAJEVAL framework decomposes every agent trajectory into three stages, each measured with standard information retrieval metrics. [Source: TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis]

graph LR
    S[Search<br/>File localization] --> R[Read<br/>Function comprehension]
    R --> E[Edit<br/>Modification targeting]

    S -.-> SP[Precision: did it open<br/>only relevant files?]
    S -.-> SR[Recall: did it find<br/>all necessary files?]
    R -.-> RP[Precision: did it read<br/>only needed functions?]
    R -.-> RR[Recall: did it read<br/>all needed functions?]
    E -.-> EP[Precision: did it edit<br/>only the right locations?]
    E -.-> ER[Recall: did it edit<br/>all required locations?]
Stage What it measures Precision question Recall question
Search File localization Did it open only relevant files? Did it find all necessary files?
Read Function comprehension (semantic context loading) Did it examine only needed functions? Did it examine all needed functions?
Edit Modification targeting Did it change only the right locations? Did it change all required locations?

Compare each stage against the reference patch to compute precision and recall independently — the complement to outcome grading, which scores only the final state.

Stage independence is why this works: you compute precision and recall at each stage against the same reference, so a failure in one stage does not distort the scores in others.

What the evidence shows

Analysis of 16,758 trajectories across three architectures and seven models reveals patterns that binary metrics hide. [Source: TRAJEVAL]

Universal over-reading

All agents examine about 22x more functions than necessary — a structural property of how agents explore code, not a model-specific bug. Reducing read scope has the highest ROI for most configurations.

Model-specific failure stages

Different models fail at different stages:

Model Primary failure stage Implication
GPT-5 Edit (targets wrong locations) Improve edit targeting — search and read are adequate
Qwen-32B Search (misses files entirely) Improve file discovery — edits are accurate when it finds the right code

A single Pass@1 score would rank both equally. Stage decomposition reveals they need opposite interventions.

Predictive power

Stage-level metrics predict Pass@1 within 0.87-2.1% MAE at the model level, reconstructing aggregate outcomes while providing richer diagnostic signal.

Applying this in practice

1. Log trajectories with stage boundaries

Capture which files were opened (search), which functions were read (read), and which locations were modified (edit). Trajectory logging provides the capture layer.

2. Compute per-stage precision and recall

For each failed task, compare agent actions against the reference at each stage:

search_precision = |files_opened ∩ files_in_patch| / |files_opened|
search_recall    = |files_opened ∩ files_in_patch| / |files_in_patch|

Apply the same formula at the read and edit levels.

3. Diagnose before optimizing

Symptom Stage bottleneck Likely fix
Low search recall Agent misses relevant files Better repository maps, improved file discovery tools
Low read precision Agent reads too many functions Tighter context filtering, semantic context loading
Low edit precision Agent modifies wrong locations More specific edit instructions, constraint-based prompting

4. Inject real-time feedback

Stage-level signals extend beyond post-hoc analysis. Feeding trajectory diagnostics back during execution improved two models by 2.2-4.6 percentage points while reducing token costs by 20-31% — aligning with agent self-review loops. [Source: TRAJEVAL]

When to use it, and when not to

Outcome grading is the right default — it avoids penalizing valid alternative solutions. Add trajectory decomposition when an agent is failing and you need to know why, when comparing models by failure profile, or when deciding which component (search, context, edit) to improve next.

Skip it when:

  • No reference patch: precision and recall require a known-correct solution. Open-ended tasks and production settings without ground truth cannot be evaluated this way — fall back to outcome grading, which needs no reference patch.
  • Non-sequential stages: the model assumes forward-linear traversal. Agents that interleave stages (read → search → read → edit) produce ambiguous per-stage metrics.
  • Uninstrumented trajectories: stage decomposition requires trajectory logs that separate file-open, function-read, and location-edit events. Agents wrapped in opaque APIs or sandboxes cannot be decomposed.

Key Takeaways

  • Binary pass/fail metrics hide where agents fail — decompose into search, read, and edit stages
  • All current agents over-read by ~22x — reducing read scope is the highest-ROI optimization for most setups
  • Different models fail at different stages — diagnose first, then apply model-specific fixes
  • Stage-level feedback during execution (not just post-hoc) improves outcomes and reduces cost
  • Use outcome grading for scoring, trajectory decomposition for diagnosis — they serve different purposes
Feedback