Verification¶
How to measure agent output quality, design evaluation suites, and use evals to drive development.
Measuring Quality¶
- RAG/Agent Reliability Problem Map — Structured 16-domain failure taxonomy for systematic diagnosis of RAG and agent failures across retrieval, reasoning, state, and deployment layers
- Benchmark Contamination as Eval Risk — Static benchmarks inflate model scores as training data overlaps with test sets — decontaminated pipelines restore honest measurement
- Control Lexical Leakage in Agent-Memory Retrieval Evals (Entity-Collision) — A single hit@k confounds semantic retrieval with lexical overlap; pin BM25 with shared-entity distractors and stratify queries by tag so embedder lift is attributable rather than averaged
- Controlled Benchmark Rewriting for Agent Safety Judgment — Rewrite unsafe trajectories into deceptive variants while preserving risk labels to measure judgment robustness on out-of-distribution surface forms
- Decomposed Red-Teaming for Agent Monitors — Split attack construction into strategy, execution, and refinement stages so monitor evaluations expose the conceive-execute gap; drops Opus 4.5 catch rate from 94.9% to 60.3%
- Overeager-Behavior Elicitation: Scope + Trap Fragments — Compose benign scenarios from reusable scope and trap fragments, score with a judge-free filesystem-delta oracle, and use Thompson sampling to elicit overeager tool calls task-completion and jailbreak benchmarks both miss
- Grade Agent Outcomes, Not Execution Paths — Evaluate agents by the final state they produce, not the sequence of steps they took to get there
- Use pass@k and pass^k to Separate Agent Capability from Consistency — pass@k measures capability ceiling; pass^k measures consistency — report both to distinguish agents that sometimes succeed from those that reliably do
- PASS@(k,T): Evaluate RL for Agents Along Sampling and Interaction Depth — Vary sampling budget k and interaction depth T jointly to separate capability expansion from efficiency gains when evaluating RL post-training for tool-use agents
- Markov-Chain Reliability for LLM Agents: Audit the Abstraction Before You Trust the Metric — pass@k, pass^k, and the reliability decay curve are projections of one first-passage distribution; fit an absorbing DTMC to traces and report a goodness-of-fit certificate to make any of those numbers defensible
- Decomposing Agent Output Variability by Layer (Sampling vs Orchestration State) — Separate run-to-run agent variability into token-sampling, infrastructure, and orchestration-state layers so the mitigation matches the layer; a single trajectory cannot distinguish them
- Trajectory Decomposition: Diagnose Where Coding Agents Fail — Decompose agent trajectories into search, read, and edit stages with per-stage precision and recall to pinpoint where and why an agent went wrong
- Repository Perturbation as Context-Reasoning Diagnosis (RepoMirage) — Apply semantics-preserving repository perturbations before an agent runs to isolate context reasoning from end-to-end issue-resolution shortcuts; average score drops 66.8% to 25.3% on the explicit-task formulation
- Precise Debugging: Measure Edit Precision, Not Just Test Pass Rate — Frontier LLMs pass unit tests on debugging tasks by regenerating large chunks of code rather than making targeted edits — edit-level precision and bug-level recall expose the gap
- Nonstandard Errors in AI Agents — Agents analyzing identical data diverge systematically by model family; treat single-run outputs as one point from an unsampled distribution
- Benchmark-Driven Tool Selection for Code Generation — Use realistic, telemetry-derived benchmarks to evaluate AI coding tools — synthetic puzzles hide language-specific and task-specific weaknesses
- Completion Failure Taxonomy — Two-thirds of code completion failures are model errors, but one quarter are integration failures — fix both to improve acceptance rates
- LLM Agent Bug Fix Taxonomy — 23 recurrent fix patterns from 930 real LLM-agent bugs; the tools component dominates and framework version churn drives most fixes
- CausalFlow: Counterfactual Repair for Failed Agent Trajectories — Score each step in a failed run by counterfactual lift; the step whose oracle-guided replacement flips the outcome is the failure cause, and the replacement is a validated repair — applicable when replay is isolated, success is binary, and the failure is not a cascade
- Constraint Decay in Backend Code Generation — Multi-file backend agents drop ~30 percentage points in assertion pass rate as architectural, ORM, and framework constraints accumulate — convention-heavy frameworks take the largest hit
- Eval Blind Spots: Structural Gaps in Measurement Methodology — Four measurement-methodology gaps (held-out, trajectory-opaque, skill-retrieval, test-evolution) a stronger model cannot close — each needs a harness fix, not a better model
- Dominator-Graph Trajectory Invariants for Non-Deterministic Agents — Validate branching agent runs by checking which states must dominate success — compiler-theory dominance over trajectory graphs replaces brittle scripted assertions when 2–10 successful traces are available
- Multi-Turn Conversation Evaluation — Pair per-turn scoring with a trace-level resolution check so the two layers catch context loss, intent drift, and circular exchange that single-turn metrics miss
- Macro Evals for Agentic Systems — Aggregate per-trace findings across a corpus of agent runs to surface recurring behavior patterns that single-trace evals cannot expose — when volume, judge quality, and selection bias permit
- Variance-Based RL Sample Selection — Profile training samples by score variance before RL fine-tuning to identify the productive subset where the model sometimes succeeds and sometimes fails
- CoT Robustness in Code Generation — Chain-of-thought is not a universal win for code generation; measure Pass@1 and Pass^k with and without CoT before enabling it as a default
- Distillation-Induced Similarity Metrics for Tool-Use Agents — Quantify how much two models share non-mandatory tool-use behaviour with Response Pattern Similarity and Action Graph Similarity to surface correlated failure modes before routing or ensembling treats them as independent
- Learned Prefix Monitors for Agent Traces — Online failure-warning monitors learn an event abstraction and a prefix-risk score from terminal outcomes; useful complement to deterministic guardrails, but high AUPRC does not imply usable alerts
- ComplexMCP: Three Bottlenecks in Large Interdependent Tool Sandboxes — 300+ MCP tools across stateful sandboxes expose tool retrieval saturation, over-confidence skipping verification, and strategic defeatism — each maps to a deployment choice
Behavioral Testing¶
- Behavioral Testing for Agents — Test decision quality and end-state for non-deterministic agent systems using capability matrices, three grading methods, and acceptable variance thresholds
- FLARE: Coverage-Guided Fuzzing for Multi-Agent LLM Systems — Apply coverage-guided fuzzing to multi-agent systems using interaction path coverage as the exploration signal to surface coordination failures and emergent failure modes
- Structural Coverage Criteria for Agent Workflows — Represent multi-agent workflows as a typed coordination graph and derive coverage obligations over reachable agents, allowed tool edges, restricted tool edges, and delegation edges — a test-adequacy layer that complements end-to-end success scores
- Mutation Testing as a Quality Gate for AI-Generated Test Suites — Coverage proves a line ran; mutation testing proves the suite would notice a regression — the discriminator that separates ceremonial agent-written tests from load-bearing ones
- Planted-Bug Methodology: Deliberate Bugs as Observability Calibration — Plant deterministic bugs and check that captured signals lead an agent to the responsible layer — if they don't, the gap is in the instrumentation, not the bug
Regression Testing¶
- Golden Query Pairs as Continuous Regression Tests for Agents — Maintain curated question-answer pairs with known-good outputs and run them continuously using semantic grading to catch capability regressions
- Human-Review-Driven Curation of Golden Eval Datasets — Sample production traces on intent, attribute each disagreement to scorer or agent, and feed only agent-failure labels back into the golden set to keep an LLM-judge suite aligned with a moving production distribution
- Pre-Change Impact Analysis — Build a code-to-test dependency map and deliver it as a lightweight agent skill so agents verify at-risk tests before committing, cutting regressions by 70%
Eval-Driven Development¶
- Eval-Driven Development: Write Evals Before Building Agent Features — Define correctness criteria before implementation so every agent change is validated against a stable, reusable test suite
- Skill Evals — Treat each skill as an evaluable unit with a labelled dataset, paired with-skill vs baseline runs, and a benchmark that quantifies pass-rate, time, and token trade-offs
Review Techniques¶
- Five-Pass Blunder Hunt — Run the same critique prompt five times in sequence on a plan or spec; each pass normalises the issues it finds, forcing later passes deeper into structural and logical problems
- Pre-Completion Checklists — Block agent completion signals with a mandatory verification sequence
- Golden Journeys: Restartability as a First-Class Verification Primitive — Name a small set of end-to-end paths with explicit failure signals per step and gate completion on the system restarting cleanly afterward
- Test-Driven Intent Clarification — Use AI-generated tests to surface specification ambiguity before code review — validate tests instead of code to clarify intent with lower cognitive cost
- Source-Grounded Test Plan with Pre-Action Assertion Annotation — Before a UI-driving agent verifies its change, have it write a source-read test plan and annotate each step's expected behavior upfront so it cannot rationalize an unexpected result as a pass
- Spec-Derived Execution as a Correctness Oracle — Judge candidate code against a natural-language spec by deriving inputs from the spec, executing them, and grading the I/O pairs — ground the LLM judge in real execution traces instead of asking it to reason over the code
Rubric Design¶
- Anti-Reward-Hacking: Rubrics That Resist Gaming — Design eval rubrics with orthogonal signals so no single metric is gameable by agents
- Symptom-Reduction-as-Root-Cause: Why Oracle Tests Alone Miss Architectural Drift — Agents iterating against fiducial-point oracle tests will adjust coefficients inside an architecture that cannot represent the target — diverse-parameter tests, cross-session changelogs, and an anti-fudge-factor rule catch what oracles miss
- Eval Awareness: Designing Evals Agents Cannot Recognise — Frontier models detect eval-shaped prompts and shift behaviour between evaluation and production — remove the signals that cue recognition
- Evaluator Templates: Portable Primitives for Agent Eval Suites — Reusable judge templates cover the portable subset of eval questions — security, PII, format, trajectory — while domain quality still needs custom evaluators
Guardrails¶
- Deterministic Guardrails Around Probabilistic Agents — Wrap agent output in hard, deterministic checks — linting, schema validation, CI gates — that enforce correctness regardless of what the agent produces
- Staged Evidence Gates for Agentic Program Repair — Order cheap evidence gates ahead of expensive ones in agentic repair loops — retrieval-grounded context, compile gate, target-test gate, then full regression — to filter invalid candidates before paying full-suite cost
- Dependency Gap Validation for AI-Generated Code — AI coding agents declare a fraction of the dependencies their code actually needs at runtime — validate in clean environments before trusting the manifest
- Phantom Symbol Detection for LLM API Migration — Verify symbols in LLM-generated migration code against a documentation-derived knowledge base — a deterministic check that catches fabricated imports, constructors, and methods that probabilistic judges miss
- Generative Provenance Records for Tool-Using Agents — Emit a structured record (tool turn, evidence span, relation) alongside each output sentence so a mechanical verifier can check claim-level grounding before the answer leaves the loop
- Defense-in-Depth Against Coding Agent Fabrication (Honesty Harness) — Four uncorrelated layers — instruction-level honesty rules, verify-before-write, real-time hooks that feed output back, and an external-tool fact-checker subagent — that reduce fabrication survival without claiming elimination
Tooling¶
- Test Harness Design for LLM Context Windows — Terse stdout, verbose log files, and grep-friendly error lines that keep agent context clean and actionable during evaluation runs
- Runnable Documentation as Agent Verification — Extract inline code examples into standalone files that CI executes on every build so doc rot fails the build the same way broken code does