Human-AI Review Synergy¶

AI reviewer suggestions are adopted at 16.6% versus 56.5% for humans — but the gap is a design input, not a failure. Structure collaboration around measured strengths.

The evidence¶

An empirical study of 278,790 code review conversations across 300 GitHub projects (2022-2025) quantifies how AI and human reviewers differ (arxiv:2603.15911).

Adoption rate gap¶

Human suggestions are adopted 56.5% of the time versus 16.6% for AI — a 39.9 percentage point gap. Over half of unadopted AI suggestions are incorrect or addressed through alternative developer fixes (arxiv:2603.15911). AI review output requires triage — not equivalence to a human comment.

Complexity cost¶

Adopted AI suggestions produce larger increases in cyclomatic complexity (+0.085 vs -0.002 for humans) and code size (+0.216 statements vs +0.108 for humans) (arxiv:2603.15911). Human reviewers tend toward simplification. AI agents tend toward addition, which risks technical debt even when the suggestion is correct.

Verbosity and focus¶

AI agents produce 29.6 tokens per line of code reviewed versus 4.1 for humans — a 7x difference (arxiv:2603.15911). Over 95% of AI comments target code improvement or defect detection, while humans spread across understanding questions (17-31%), knowledge transfer (4-6%), testing feedback, and social communication (arxiv:2603.15911). AI review misses mentoring, architectural questioning, and team knowledge sharing.

Conversation dynamics¶

85-87% of AI-initiated reviews end without follow-up discussion, with 7.1-25.8% rejection rates versus 0.9-7.8% for human conversations. Reviews of AI-generated code require 11.8% more review rounds than human-written code (arxiv:2603.15911).

Structuring the collaboration¶

flowchart LR
    PR[PR Submitted] --> AI[AI Agent Review]
    AI --> Triage{Developer Triage}
    Triage -->|Adopt| Fix[Apply Fix]
    Triage -->|Reject| Skip[Discard]
    Triage -->|Alternative| Alt[Own Fix]
    Fix --> Human[Human Review]
    Skip --> Human
    Alt --> Human
    Human --> Merge[Merge Decision]

The data supports a specific model.

AI first, human last. AI agents handle the mechanical first pass — defect detection and code improvement suggestions. The human reviewer makes the final decision, covering design judgment, knowledge transfer, and architectural fit that AI misses (arxiv:2603.15911).

Triage AI output rather than rubber-stamping it. With a 16.6% adoption rate and over half of rejections caused by incorrect suggestions, treating AI review output as a todo list works against you. Evaluate each suggestion on its own.

Monitor the effect on complexity. Track whether adopted AI suggestions increase cyclomatic complexity or code size more than human suggestions do. A technically correct suggestion can still harm the architecture.

Constrain AI verbosity. At 7x the tokens per line of code, unconstrained AI review output creates the same alert fatigue that signal-over-volume design addresses. Configure review agents with confidence thresholds and severity filters.

Use multi-agent verification. A second AI agent that validates the first agent's findings can filter incorrect suggestions before they reach the developer, which reduces triage load. This connects to the committee review pattern. When both agents share the same training distribution and no executable specification anchors the review, correlated failures can echo rather than cancel (arxiv:2603.25773).

When this backfires¶

The AI-first, human-last model adds triage cost to every PR. In high-velocity repositories where AI suggestion quality is low (below 10% adoption), the developer triage burden can exceed the defect-catch benefit — measure adoption rates per-repo before committing to the pattern.

When AI-generated code is reviewed by AI agents from the same model family, correlated failures emerge: the reviewing agent reasons from the same training distribution as the generating agent and misses the same classes of error (arxiv:2603.25773). Executable specifications or cross-family agent panels mitigate this.

Teams without confidence threshold or severity filter tooling will experience the 7x verbosity gap as alert fatigue that degrades trust in AI review output over time, reducing effective adoption below the 16.6% baseline.

Why it works¶

AI agents excel at pattern-matching defect databases and applying consistent rules without fatigue. Human reviewers contribute judgment AI lacks: architectural fit, knowledge transfer, and evolving team conventions. The sequential model works because AI pre-triage reduces the human reviewer's cognitive load — known-pattern defects are addressed first, freeing attention for higher-level concerns. The data confirms the role division: over 95% of AI comments target defects and improvements, while human comments distribute across understanding, knowledge transfer, and design (arxiv:2603.15911).

Key Takeaways¶

AI review suggestions are adopted at one-third the rate of human suggestions (16.6% vs 56.5%) — design workflows around this reality
Adopted AI suggestions increase code complexity more than human suggestions — monitor for technical debt accumulation
AI review covers only two categories (defects and improvements) while humans provide mentoring, knowledge transfer, and architectural feedback
The human-last principle ensures design judgment and team context inform the merge decision
85-87% of AI reviews end without discussion — AI review is a broadcast, not a conversation

Law of Triviality in AI PRs — why large AI-generated diffs get rubber-stamped while small changes attract debate
Signal Over Volume in AI Review — design principle for the verbosity problem quantified here (29.6 vs 4.1 tokens/LOC)
Agent-Assisted Code Review — prescriptive guide for AI-first review that this page provides empirical backing for
Agent PR Volume vs. Value — PR authoring acceptance rates, complementary to the review suggestion adoption rates here
Committee Review Pattern — multi-agent verification approach suggested by the study
Tiered Code Review — risk-based routing that aligns with the human-last principle
Agentic Code Review Architecture — tool-calling architecture for the AI review stage
Agent-Authored PR Integration — reviewer engagement as merge predictor, complementary to the adoption rate findings here
Predicting Reviewable Code — pre-flagging AI-generated functions reviewers will delete, addressing review burden from the AI side
Review-Then-Implement Loop — closing the loop between AI review findings and automated fixes
Diff-Based Review — focusing review on changes rather than full outputs, relevant to managing AI review verbosity
PR Description Style as a Lever — PR description structure as a configurable parameter affecting reviewer engagement
CRA-Only Review and the Merge Rate Gap — empirical data on how CRA-only review affects merge outcomes; complementary evidence to the adoption rate findings here
Self-Improving Code Review Agents — Learned Rules — adaptive agents that extract accept/reject signals to improve future suggestion adoption rates, directly addressing the 16.6% baseline