Reviewer's Playbook for Agent-Authored Pull Requests¶

A time-boxed inspection priority order for reviewing agent-authored PRs — CI changes first, then duplicated utilities, then the critical path, then evidence.

Reviewing an agent-authored pull request verifies context, not correctness — the agent has already produced code that parses and usually passes its own tests. The defects that ship are the ones the agent could not catch itself: weakened CI, reimplemented utilities, boundary cases the happy path doesn't reach, and unsourced assumptions about contracts. This playbook is the inspection priority order GitHub's engineering team published on 7 May 2026 (GitHub Blog), with the conditions under which it stops working.

The ten-minute inspection order¶

GitHub sequenced the recommended order so the earliest steps catch the most expensive defects:

Slot	Step	What you are checking
1–2 min	Scan and classify	File list, diff size, PR description complete
2–3 min	CI changes first	`.github/workflows/`, coverage thresholds, skipped tests
3–5 min	Scan for duplicates	New utilities that already exist elsewhere
5–8 min	Trace the critical path	Input → transforms → output for boundary cases
8–9 min	Security boundaries	Untrusted input in workflows, shell-execution paths
9–10 min	Require evidence	A test that fails on the pre-change behavior

Source: GitHub's Agent pull requests are everywhere. Here's how to review them.

The order is load-bearing. CI changes come first as the cheapest hard stop: "Before reading a single line of app code, look at anything touching .github/workflows, test configs, coverage settings, or build scripts" (GitHub Blog). Evidence comes last because it is the most expensive, worth demanding only once the cheaper checks have not already disqualified the PR.

The five defect classes the playbook targets¶

Each step in the order targets a specific known failure mode of agent-authored code, not a generic quality concern (GitHub Blog):

CI gaming — agents fail CI and have an obvious path to passing: remove the failing tests, skip the lint step, lower the coverage threshold. Coverage delta is the canonical tell.
Code reuse blindness — agents rarely confirm a utility doing the same thing does not already exist. The symptom is two near-identical functions with different names. Sourcegraph's 1,281-run study shows agents on grep-only retrieval modify 2 of 7 affected files; with structural search they modify all 7 (Sourcegraph).
Hallucinated correctness — code that compiles, passes every test, and is wrong: off-by-one pagination, missing permission checks on untested branches, validation that handles every sampled case and none of the boundary ones.
Untrusted input in workflows — PR bodies, issue bodies, or commit messages interpolated into agent prompts without sanitization; model output executed as shell without validation.
Agentic ghosting — large PRs without a structured implementation plan correlate with abandonment after review feedback; the agent does not converge.

Where defects hide beyond the diff¶

The non-obvious failure mode is that defects often hide outside the changed lines:

Adjacent unchanged files where the contract the agent assumed diverges from the contract the codebase actually exposes — Sourcegraph's "Partial Completion" pattern (Sourcegraph)
Existing utilities that the agent reimplemented because keyword search returned hundreds of matches and it picked the wrong one — "Wrong File, Wrong Symbol" (Sourcegraph)
Test files in unrelated directories that exercise the modified code path through a different entrypoint and now fail intermittently
Workflow files the agent edited to make CI pass without acknowledging the change in the PR description

Steps 3–5 target this surface explicitly: scan for duplicated utilities across the repository, then trace the critical path through unchanged code. Diff-only review misses this category entirely (see Diff-Based Review).

Distinguishing unidiomatic-but-valid from fabricated¶

The single most useful heuristic is to demand a test that fails on the pre-change behavior. "If the agent can't write a test that would have caught the bug it claims to fix, the fix is incomplete or the understanding is wrong" (GitHub Blog). The test proves the agent understood the actual defect, not a plausible-sounding one, and it pins the contract for future edits to the same area.

Three diff-level tells separate a reasonable unidiomatic approach from a fabricated one:

Unidiomatic but valid — symbols resolve, imports are real, the approach differs from house style but works; comment to align with conventions, do not block.
Fabricated contract — the agent calls response.paginate(cursor) on a type whose class never defines that method, or reads a retry_after_ms config key the schema does not declare.
Phantom dependency — the diff adds import lodash-es with no matching package.json entry, or calls a v3 endpoint method the installed SDK only exposes through v2.

Run the flagged code path locally before approving — agents fabricate confidently, and the surface markers alone are too weak to trust.

Effort budget: AI versus human PRs¶

The review-effort budget shifts in two directions. Automated review handles more of the mechanical first pass — style, type mismatches, error handling, missing edge cases — freeing human attention for the semantic judgment agents cannot self-check (see Agent-Assisted Code Review). The human role narrows to context verification: did the agent understand the codebase, the contract, and the intent.

Request a smaller PR rather than approving as-is when (GitHub Blog):

Diff touches more than five unrelated files
Purpose cannot be described in one sentence
PR body is empty or boilerplate
CI is failing with only test-file changes

These are the conditions under which the 10-minute framework cannot complete — restructuring is cheaper than rushing the review.

The human-factors reading of the same limit comes from Jon Udell, who names the "unreviewable PR" as an anti-pattern and reframes "human in the loop" as "agent in the loop" — the fix is to size agent PRs to stay reviewable rather than to expand review effort to match them (Simon Willison).

When this backfires¶

The playbook describes the target review, not the achievable one. Five conditions degrade it:

High agent-PR volume exceeds reviewer bandwidth — past roughly 400 lines per diff, sustained review degrades into surface scanning regardless of checklist quality (Atomic Robot). The framework assumes the reviewer can spend 10 focused minutes per PR; volume kills that assumption.
CRA-only review — when an automated Reviewer Code Assistant applies the checklist mechanically with no human at the keyboard, merge rates drop to about 45% versus about 68% for human-only review (CRA-Only Review and the Merge Rate Gap). The playbook is for humans.
Grep-only repositories — two steps (scan for duplicates, trace critical path) presume keyword and structural search. Without code intelligence, the reviewer faces the same retrieval ceiling Sourcegraph measured for agents (Sourcegraph).
Comment volume as a correction signal — each extra reviewer comment on an agent PR correlates with a 2.8 percentage-point decrease in merge probability (versus +2.7% for human PRs), interpreted as required corrections rather than productive alignment (arXiv:2601.18749). A thorough checklist that surfaces ten findings per PR amplifies ghosting; consolidate into a single review round.
Alert fatigue past the rubber-stamp threshold — two-thirds of surveyed developers bypass or delay security checks under pressure (CodeAnt). When the checklist becomes a comfort blanket producing approvals without genuine inspection, removing PRs from the queue (tiered routing, structural agent constraints, smaller scopes) yields more than refining the checklist.

Why it works¶

AI coding agents optimize for syntactic correctness and surface plausibility — code that parses, passes type checks, and matches training-data patterns — not for semantic correctness, architectural fit, or whether the result solves the problem (Atomic Robot). The inspection order is built around that asymmetry: CI changes catch the gaming move, duplicate scans the reuse-blindness move, critical-path tracing the boundary-case move, and evidence-by-failing-test the hallucinated-correctness move. Without an order that targets the generator's known weaknesses, generic checklist review degrades to the surface review the agent already produced.

Example¶

A reviewer opens an agent-authored PR that adds rate limiting to a public API. They follow the order:

1–2 min: 4 files changed, 180 lines, PR description names the rate-limit policy. Tractable size.
2–3 min: .github/workflows/ci.yml is unchanged; coverage threshold unchanged. No CI gaming. Proceed.
3–5 min: Search the repo for "rate limit" — finds an existing internal/middleware/throttle.go the agent ignored, reimplementing the same token-bucket logic in handlers/. Block with a "use existing utility" comment.
(if the duplicate had not existed) 5–8 min: Trace request → limiter → handler. The agent handles the normal case but the limiter check returns silently on zero-quota tenants instead of 429-ing. Boundary-case miss.
9–10 min: Demand a test that calls the endpoint with a zero-quota tenant and asserts a 429. The agent cannot produce one without admitting the silent-return path was unintended.

Two of the six steps caught the substantive defects. The CI and security steps protected against the catastrophic ones. Total reviewer time: under 10 minutes.

Key Takeaways¶

The inspection order is sequenced for cost: CI changes first, evidence last; running the steps out of order wastes the cheapest stops
Defects in agent PRs hide outside the changed lines — duplicated utilities, adjacent unchanged code, workflow edits — not just inside the diff
The evidence test (a failing-before test) is the strongest heuristic for distinguishing genuine fixes from fabricated ones
The review budget shifts from correctness verification (for humans) to context verification (for agents); automated review owns the mechanical first pass
The playbook degrades to surface review past ~400 lines per diff, in CRA-only setups, and in grep-only repos — recognise these conditions and reduce volume rather than refining the checklist

Agent-Authored PR Integration and Merge Predictors — collaboration signals and merge-rate predictors from the inverse angle
Agent-Assisted Code Review — agents owning the mechanical first pass the playbook leaves to automation
Diff-Based Review — diff as review boundary; complementary view of what the playbook targets
Tiered Code Review — when to escalate from AI-only to mandatory human review
CRA-Only Review and the Merge Rate Gap — the merge-rate consequences when no human runs the playbook
Signal Over Volume in AI Review — why high comment counts predict lower merge rates on agent PRs
Agent PR Volume vs. Value — bandwidth pressure that erodes the 10-minute budget
Security Review Gap in AI-Authored PRs — why step 5 (security boundaries) is a hard stop, not a heuristic