Skip to content

Dominator-Graph Trajectory Invariants for Non-Deterministic Agents

Validate non-deterministic agent runs by checking which states must dominate success — not by replaying a scripted sequence.

Dominator-graph trajectory invariants lift the compiler-theory notion of dominance (A dominates B if every path to B passes through A) onto agent traces. A small set of successful runs is merged into a graph, the dominator tree is computed, and new runs are checked by topological subsequence matching against the mandatory states — not against an exact sequence (Sharma, Mittal, Hu — GitHub Blog, 2026-05-06; paper at arxiv 2605.03159).

When to Reach for It

This technique is qualified — it earns its complexity only when all of these hold:

  • The success path branches and reconverges (UI flows, multi-tool workflows). Flat checkpoint lists cannot express conditional ordering.
  • The end state is ambiguous — you cannot write a deterministic assertion like "PR opened" or "test suite passing". If you can, outcome grading is simpler and equally sound.
  • You can collect 2–10 successful traces. The technique learns from positive examples, not failure logs alone (GitHub Blog).
  • You have multimodal-LLM budget for state-equivalence checks. The published method calls an external multimodal model to decide when two observed states mean the same thing.

If any fails, prefer simpler tools: outcome assertions, pre-completion checklists, or trajectory-match modes from agentevals (strict, unordered, superset, subset).

How It Works

Four stages, mirroring the published method (GitHub Blog, arxiv 2605.03159):

  1. Capture traces. Each successful run is recorded as a sequence of (state, action) pairs. For a UI agent, states are screenshots; for a tool-using agent, states are observable post-conditions of each tool call (filesystem state, API response shape, exit codes).
  2. Build a Prefix Tree Acceptor (PTA). Traces are merged into a directed graph: nodes are observable states, edges are actions. Branching captures non-deterministic variation (a loading screen that appears in some runs); convergence captures where alternative paths rejoin.
  3. Merge equivalent states. A three-tier comparison decides when two nodes are the same state: perceptual hashes / SSIM for near-identical visuals, multimodal LLM analysis for semantic equivalence, conservative merging only when both signals agree.
  4. Compute dominators, validate by topological subsequence matching. The Lengauer–Tarjan algorithm computes the dominator tree in near-linear time O((V+E)·α(V+E)) (Lengauer & Tarjan, TOPLAS 1979; Boost Graph Library). A new run passes if its observed states include the dominator subtree in the required logical order — gaps and detours between dominators are allowed.
graph LR
    T[2-10 successful<br>traces] --> PTA[Prefix Tree<br>Acceptor]
    PTA --> MERGE[State-equivalence<br>merge]
    MERGE --> DOM[Dominator tree<br>= essential states]
    DOM --> CHECK[Topological subsequence<br>match new runs]

Why It Works

A node A dominates a node B iff every observed successful path through B also passes through A. The dominance relation is a sound necessary-condition test: if A dominates B in the learned graph and a new run reaches B without passing through A, the run is either following an unobserved success variant or failing — both warrant review. The relation generalises flat "must visit checkpoint" semantics to a graph: it expresses conditional dominance ("if state X was reached, the run must have come through Y first") that flat lists and unordered trajectory-match modes cannot.

The empirical evidence: on a VS Code extension agent test scenario, the dominator-tree method scored 100% F1 vs. 69.8% for agent self-assessment (40-point recall improvement), and 52.2% F1 vs. 0% on discriminating execution noise from genuine product regressions (GitHub Blog, 2026-05-06). The baseline is "agent grades its own run" — the comparison against a tuned outcome-only or curated-checkpoint baseline is not published, so the published delta proves the mechanism works, not that it beats every simpler alternative.

When This Backfires

The "When to Reach for It" conditions cover the positive case. Three further failure modes are worth naming because they bite teams that apply the technique correctly:

  • Failure modes are duration- or negation-based. The technique validates ordering of included states. It does not catch "stuck on a loading screen for 90 seconds" (duration) or "must NOT have called the production-API tool" (negation). Pair with deterministic guardrails for those invariants.
  • Traces are very short or strictly linear. A two-step agent has no branching to exploit; dominator analysis degenerates to a flat assertion the team should write directly.
  • Trace diversity is low. Any dominance relation learned from a small sample under-approximates the true must-precede relation: an incidental state may appear in every captured trace and be erroneously marked mandatory. Calibrate by intentionally varying the inputs of captured runs (Sharma et al., arxiv 2605.03159).

Example

A Copilot Cloud Agent task: "open a PR that fixes the failing test in repo R." Successful runs vary on file-tree vs. search palette, test-first vs. implementation-first reading, local-test vs. CI-only verification. A scripted assertion ("read_file then run_tests then create_pr") fails on the alternative paths; outcome grading ("PR exists and CI is green") loses signal when CI is flaky.

Dominator-graph invariants learned from 3 successful traces capture the structure:

  • repo opened dominates every later state.
  • failing test identified dominates fix proposed.
  • fix proposed dominates PR created.
  • read_file and run_tests are not mandatory dominators — they are optional variations.

A new run that creates a PR without first reaching failing test identified fails the invariant, even though the outcome ("PR exists") looks correct.

Key Takeaways

  • Dominator analysis lifts compiler theory onto agent traces — it expresses "must precede" as a graph relation, not a flat list.
  • The technique is qualified: it pays off on branching, ambiguous-outcome workflows with 2–10 successful traces and multimodal-LLM budget; it is overkill when the end state is directly assertable.
  • The reported 100% / 52.2% F1 numbers are against agent self-assessment, not against tuned outcome or checkpoint baselines — frame results honestly when comparing.
  • Pair dominator-graph invariants with deterministic guardrails for the failure modes it does not catch: duration and negative constraints.
  • The Lengauer–Tarjan dominator step is near-linear; the cost is in state-equivalence merging (multimodal LLM calls), not in the dominator computation itself.
Feedback