Skip to content

Trajectory Pre-Filter for Failure Diagnosis (TrajAudit)

Pre-filter long agent trajectories with pattern matching and seed an investigator LLM with a test-report-derived preliminary diagnosis so long-context attention concentrates on failure-relevant spans.

The technique applies when a coding agent has already failed a repository-scale task, the trajectory is long and noisy, and a structured test-failure report exists. Two pre-filters wrap the investigator LLM: a pattern-matching noise filter that strips redundant program structure and verbose code context, and a preliminary diagnosis module that converts the test-failure report into prior hypotheses the investigator consults before traversing the trajectory (TrajAudit, arxiv 2605.26563).

When This Applies

Confirm all three conditions before adopting:

  • Trajectory length exceeds the investigator's effective long-context budget. TrajAudit targets repository-level coding-agent runs whose trajectories are "very long, while long-context reasoning remains a known weakness of LLMs" (arxiv 2605.26563). Short trajectories that already fit comfortably in the investigator's window do not benefit — the filter adds latency without recall improvement.
  • A structured test-failure report exists. The preliminary diagnosis module is seeded from the test-failure artifact; without one, the investigator has no prior to anchor on and the second pre-filter contributes nothing (arxiv 2605.26563).
  • Trajectory noise is dominated by predictable patterns. Pattern matching only removes failure-irrelevant content when "redundant program structure and verbose code context" actually compose the bulk of trajectory tokens (arxiv 2605.26563). Trajectories whose noise is heterogeneous or context-dependent leave the filter with little to cut.

If any condition is missing, fall back to agent debugging for the four-mode taxonomy, or trajectory decomposition when the goal is per-stage grading rather than root-cause localization.

The Two Pre-Filters

graph LR
    A[Failed run<br/>trajectory + test report] --> B[Noise filter<br/>pattern match]
    A --> C[Preliminary diagnosis<br/>from test report]
    B --> D[Investigator LLM<br/>retrieves on demand]
    C --> D
    D --> E[Localized failure cause]

1. Noise filter

Pattern matching and keyword detection strip failure-irrelevant trajectory content before the investigator sees it. Redundant program structure (repeated imports, scaffolding boilerplate) and verbose code context (full file dumps where a function suffices) are the named targets (arxiv 2605.26563). The filter is heuristic, not semantic — it compresses the attention budget rather than condensing meaning.

2. Preliminary diagnosis

The test-failure report (stack trace, assertion failure, error message) is converted into initial diagnostic hypotheses before the investigator traverses the trajectory. The hypotheses act as a prior — the investigator confirms, refines, or rejects them rather than starting from a blank slate against the full noisy trajectory (arxiv 2605.26563).

3. On-demand retrieval

The investigator retrieves filtered trajectory spans on demand rather than ingesting everything — closer to retrieval-augmented investigation than full-context inspection. The filter and the prior together decide what to pull next (arxiv 2605.26563).

What the Evidence Shows

TrajAudit reports the following on RootSE, a benchmark of 93 real-world failure instances from software maintenance tasks that the authors describe as "the most complex trajectory diagnosis benchmark to date" (arxiv 2605.26563):

Metric Result
Localization accuracy vs. baselines +24.4 percentage points
Token consumption At least 18% reduction

The reported gains come on RootSE specifically. Generalization to trajectories whose noise profile or test-report shape diverges from RootSE is not established.

Why It Works

Long-context LLM degradation is the load-bearing mechanism. LLMs lose recall on long contexts approximately in proportion to irrelevant tokens in the window; trajectory noise — redundant program structure, verbose code context — consumes the investigator's attention budget without contributing to localization. Pre-filtering shifts irrelevant content out of the budget. A preliminary diagnosis anchors the investigator's first hypothesis, reducing the number of trajectory traversals before convergence (arxiv 2605.26563). Both effects compound: a shorter, denser context window with an anchored prior is the regime LLMs handle best.

When This Backfires

The technique adds infrastructure overhead and introduces failure modes of its own. Three conditions where it harms more than it helps:

  1. Bug lives in the filter's blind spot. The noise filter is pattern-matching, not semantic. When the failure root cause sits in code the filter classifies as redundant context — boilerplate that turned out to matter, a "scaffolding" file that the bug routed through — the filter actively removes the evidence the investigator needs. Aggregate localization accuracy on a benchmark does not tell you the false-negative rate on filter-discarded spans.
  2. Symptom and cause are decoupled. The preliminary diagnosis is seeded from the test-failure report, so it biases the investigator toward the region of the trajectory consistent with the test's framing of failure. When the symptom (test fails on assertion A) and root cause (state corrupted three calls earlier in module B) are decoupled, the prior anchors the investigator in the wrong place.
  3. No test-failure artifact. Runtime correctness bugs that pass tests but produce wrong output, agent loops without test runs, and open-ended generation tasks all lack the structured failure report the preliminary diagnosis module consumes. Without that seed, the second pre-filter contributes nothing and the noise-filter overhead remains.

A steelman of the opposite approach — feed the investigator the raw, unfiltered trajectory — is reasonable for small harnesses (<50k trajectory tokens), heterogeneous noise profiles, or workloads where the cost of dropping a critical line outweighs the latency saved. As frontier models' long-context reasoning improves, the noise-filter layer's marginal value shrinks.

Example

The TrajAudit evaluation runs on RootSE, 93 real-world failure instances drawn from software-maintenance tasks (arxiv 2605.26563). Walking through the pipeline on an illustrative instance from this class:

A repository-level coding agent fails a maintenance task. The captured trajectory holds file dumps, tool-call results, repeated imports, and intermediate diffs. The test runner emits an assertion failure with a stack trace pointing into one module.

Without pre-filter — the investigator LLM receives the full trajectory plus the test report. Localization scans the long trajectory against a vague prior; content near the centre is reached late or not at all.

With pre-filter

  1. Noise filter strips repeated imports, full file dumps where a function would suffice, and scaffolding boilerplate.
  2. Preliminary diagnosis converts the stack trace into hypotheses anchored on the modules named in the trace and their direct callers.
  3. The investigator retrieves filtered spans on demand, starting from those candidates, confirming or refining as it pulls more.

The LLM operates on a smaller, prior-anchored window — the regime its long-context attention handles well. The trade-off: if the root cause sits in code the noise filter classified as redundant, the investigator never sees it.

Key Takeaways

  • Two pre-filters wrap the investigator LLM: a pattern-matching noise filter and a preliminary diagnosis seeded from the test-failure report.
  • The technique applies when trajectories are long enough that long-context degradation bites, a structured test-failure artifact exists, and trajectory noise is dominated by predictable patterns.
  • Reported gains on RootSE are +24.4 percentage points in localization accuracy and at least 18% token reduction across 93 software-maintenance failure instances (arxiv 2605.26563).
  • The mechanism is long-context degradation plus prior anchoring — the filter shifts irrelevant content out of the attention budget and the test-report-derived prior anchors the investigator's first hypothesis.
  • Failure modes: filter blind spots drop evidence the investigator needs, symptom-cause decoupling anchors the prior in the wrong region, and absence of a test-failure artifact removes the second pre-filter's input.
Feedback