Reproduce-Before-Report Verification Gate¶

A reproduce-before-report verification gate drops any reviewer finding the verifier cannot reproduce against actual code behavior.

The reproduce-before-report gate is a structural step in a multi-agent code review pipeline. A reviewer raises a candidate finding. A fresh-context verifier then tries to build a concrete reproduction against the diff before the finding ships. Findings it cannot reproduce are dropped without a trace. The gate sits at the reporter boundary: the reviewer flags freely, and the verifier turns soft suspicion into hard evidence or discards it.

What "reproduce" means at the gate¶

The verifier is not a second opinion. It takes the diff plus the reviewer's claim ("function X will crash on empty input on line 42") and tries to build evidence: an input that triggers the failure, a code path that shows the contradiction, or a citation to the specific behavior in the source. Anthropic's managed Code Review exposes this artifact — each finding "includes a collapsible extended reasoning section you can expand to understand why Claude flagged the issue and how it verified the problem" (Code Review docs). REVIEW.md makes the reproduction bar a per-repo dial; the canonical example is "behavior claims need a file:line citation in the source, not an inference from naming" (Code Review docs).

Why the verifier must be independent¶

Independence is the property that does the work. A verifier that shares context with the reviewer inherits its reasoning errors and confirms its own false positives. Two dimensions matter:

Context freshness: the verifier starts with the diff and the claim, not the reviewer's chain of thought.
Model diversity: LLM judges show documented same-family bias, scoring outputs from related model families higher (Zheng et al. 2023). Same-family verifiers reproduce same-family confabulations.

Claude Code's /code-review ultra (aliased from /ultrareview) names this in its comparison against single-pass /review: ultrareview is "multi-agent fleet with independent verification" (ultrareview docs).

Why it works¶

A reproduction is harder to fake than a judgment. A reviewer can hallucinate a defect with no external grounding. A verifier that must produce a reproduction cannot: it either runs or it does not. Multi-agent verification with conditionally independent agents in CodeX-Verify improved accuracy from 32.8% to 72.4%, and measured reviewer-pair correlations of 0.05 to 0.25 confirm non-overlapping error populations. The pattern generalizes. Agent-as-a-Judge frames tool-augmented verification against real-world observations as the structural fix for LLM-as-a-Judge's "shallow single-pass reasoning, and the inability to verify assessments against real-world observations" (Agent-as-a-Judge survey, arXiv:2601.05111).

Silent drop is the default¶

Non-reproducing findings are dropped, not appended with a low-confidence flag. The /code-review ultra docs state the purpose: "every reported finding is independently reproduced and verified, so the results focus on real bugs rather than style suggestions" (ultrareview docs). Surfacing unverified findings re-creates the noise problem the gate exists to solve.

Implementing the gate¶

Part	Job	Anthropic implementation
Reviewer	High-recall candidate generation	Per-class agents (correctness, security, regressions, edge cases) in parallel
Verifier	Reproduction against actual code	Fresh-context agent given diff + claim; reproduced or dropped
Drop policy	Discard non-reproducing findings	Silent — not surfaced to user
Reasoning artifact	Expose verification evidence	"How it verified the problem" collapsible (Code Review docs)

A pipeline that fans out reviewers but reports every candidate is not running this gate — it is fan-out, which amplifies false positives.

When this backfires¶

Small or trivial diffs. /code-review ultra runs "typically cost $5 to $20" (ultrareview docs); managed Code Review averages "$15-25 in cost, scaling with PR size, codebase complexity, and how many issues require verification" (Code Review docs). On a typo fix, gate cost dominates false-positive cost, so use a single-pass diff-based review.
Same-family verifier with no context reset. Without independence, the verifier reproduces reviewer confabulations, the same overcorrection bias the gate exists to filter. Same model plus same session is a verification gate in name only.
Reproduction-incapable defect classes. Race conditions, distributed-system failures, and intermittent flakes cannot be reproduced on demand against a static diff. Silent-drop then becomes silent-drop of the highest-stakes findings. Route those classes to a separate "suspected-but-not-reproduced" channel, or accept the recall loss.
Air-gapped or ZDR environments. Cloud fleet implementations are "not available when using Claude Code with Amazon Bedrock, Google Cloud Vertex AI, or Microsoft Foundry, and it is not available to organizations that have enabled Zero Data Retention" (ultrareview docs).
High-iteration PRs with push triggers. 30 pushes at $15-25 compounds linearly. Use @claude review once to suppress push-triggered re-verification (Code Review docs).

Example¶

A REVIEW.md snippet that raises the verification bar on a backend repo:

## Verification bar

Behavior claims need a file:line citation in the source code,
not an inference from naming. If a finding says "this function
will crash on null input," cite the line where the unchecked
dereference happens; if you cannot cite it, drop the finding.

For data-flow claims (taint, leakage, scope), cite both the
source line and the sink line. Findings that name the sink
but cannot trace the source are not reproductions — drop them.

For race-condition suspicions, cite the two interleavings.
If you cannot construct both, route the finding to the
"unverified-suspicion" channel in the summary, not as an
inline Important finding.

This configuration makes the verifier's reproduction artifact explicit per defect class: code-path bugs need source citations, taint bugs need source-plus-sink, races need an interleaving. Findings that fail their class-specific reproduction bar are dropped from the inline review rather than posted as low-confidence noise.

Key Takeaways¶

The gate sits between reviewer and user — high-recall reviewing stays freely speculative; the user only sees what survives independent reproduction
Independence (fresh context, ideally different model family) is the load-bearing property; same-family verifiers inherit reviewer biases
Silent drop of non-reproducing findings is the default — surfacing them with a confidence flag re-creates the noise problem the gate exists to solve
Multi-agent verification with conditionally independent agents lifted accuracy from 32.8% to 72.4% in CodeX-Verify — the gain is empirical, not speculative
Cost runs $5-25 per review with verification as a named scaling factor — appropriate for substantial pre-merge changes, not for typo PRs
The pattern is unavailable for ZDR organizations and non-Anthropic cloud hosting today; a self-hosted fleet is the only option there

Cloud Parallel Review Pattern — the full fan-out + verify + aggregate architecture in which this gate is the verify step
LLM Code Review Overcorrection — the false-positive bias that motivates the gate
Signal Over Volume in AI Review — the precision principle the gate enforces at the reporter boundary
Adversarial Multi-Model Development Pipeline (VSDD) — the broader fresh-context independent-adversary pattern this gate instantiates for code review
Evaluator-Optimizer Pattern — the two-role generator/evaluator loop the gate inherits and constrains
Tiered Code Review — risk-class routing that decides which depths apply the gate
Committee Review Pattern — a voting-style contrast that does not gate at the reporter boundary
Agentic Code Review Architecture — the tool-calling reviewer shape the verifier composes with