Tunable Effort Levels for Code Review Agents¶
Expose review depth as a per-PR dial backed by a published bug-discovery curve, so reviewers and routing policies trade thoroughness against cost where it matters.
The Primitive¶
A tunable code-review agent ships with named effort levels — each pinning a point on a published bug-discovery curve. A reviewer or routing policy picks one per PR.
Cursor's Bugbot externalised this dial on 2026-05-11:
| Level | Bug-discovery rate | Reviewer-stated trade-off |
|---|---|---|
| Default | 0.7 bugs/run | Optimised for efficiency and speed |
| High | 0.95 bugs/run | "More reasoning time; more expensive and slower, but finds more bugs" |
| Custom | Operator-defined | Natural-language policy: "Describe when Bugbot should use default or high effort" |
High finds 35% more bugs at constant 80% resolution rate — additional flags are addressed at merge time, not silently dismissed.
The per-PR analogue of heuristic effort scaling (agent-tiered per query) and interactive effort sliders (operator-tiered per turn) — here the unit is one PR and the decision sits with a reviewer or routing policy.
Why a Dial, Not a Constant¶
Single-calibration agents force a compromise. Calibrate for thoroughness and routine PRs drown in commentary — developers override more than 30% of flags until the tool is functionally disabled. Calibrate for signal and the agent misses real regressions in high-stakes code.
GitHub Copilot's review began as the implicit binary form: in 29% of reviews the agent stays silent and in 71% it surfaces actionable feedback. A multi-level dial generalises it — silence is the lowest rung, deep agentic exploration the highest. By mid-2026 Copilot had externalised the dial too: admins set a per-repository analysis tier of low — the fast, cost-efficient default — or a new medium tier that routes complex logic, security-sensitive code, and cross-service changes to a higher-reasoning model (GitHub, 2026-06-02), and weeks later widened the surface with new configurations and controls for tuning the review agent (GitHub, 2026-06-12). Cursor binds the choice to one PR, Copilot to one repository — coarser-grained, but the same effort-routing primitive.
Calibration is the Pattern¶
Effort labels are hedge words without a published curve.
Cursor publishes two metrics:
- Bug-discovery rate — bugs found per run, measured on
BugBench, a curated benchmark of real diffs with human-annotated bugs (Building Bugbot) - Resolution rate — share of flagged bugs authors address at merge, classified by an LLM-as-judge validated against humans
Holding resolution rate constant across levels is the calibration commitment: High costs more but does not flood with low-quality flags. Resolution rate gates whether the higher-effort flags are useful.
Routing Axes¶
The dial composes with structured routing across four axes:
graph TD
PR[Pull Request] --> Router{Routing policy}
Router -->|Critical path<br/>auth, payments, crypto| High[High effort]
Router -->|Large changeset<br/>1000+ LOC| High
Router -->|Low-trust author<br/>new contributor| High
Router -->|High historical defect rate<br/>in touched dirs| High
Router -->|Otherwise| Default[Default effort]
- File-path criticality — auth, payment, or crypto paths route to High. Same axis tiered code review uses for human escalation.
- Change size — diffs above an LOC threshold route to High.
- Author trust — first-time contributors or external forks route to High. Same signal
CODEOWNERSalready encodes. - Historical defect rate — directories with elevated post-merge bug rates route to High. Cursor exposes no signal; encode in Custom.
Cursor's Custom level encodes this as a natural-language policy evaluated per PR.
Cost-Performance Tie¶
High-effort review on critical paths is cheaper than post-merge remediation. Cursor's per-run pricing is $1.00-$1.50 by PR size, and effort levels require usage-based billing.
The arithmetic only holds when routing concentrates High on high-stakes runs. A reviewer toggling High on every PR pays the premium and accumulates the alert fatigue the dial existed to avoid.
When This Backfires¶
- Resolution rate is not precision. It counts both "developer fixed it" and "developer made it go away." The 35%-more-bugs claim at High does not rule out more false positives that authors muted. Teams needing a precision floor (security under 3% FPR, style under 2%) measure noise themselves.
- Routing policy drift in Custom. Natural-language Custom is an instruction file with the same primacy and drift issues. A correct policy in May 2026 may misroute six months later, and no eval is exposed by default.
- High-effort default creates alert fatigue. Signal over volume is the dominant trust factor. Pinning High at the policy level burns attention on routine PRs and un-funds the high-stakes runs the dial protects.
- Cost-blind defaults. Without per-PR cost ranges, reviewers toggle High on everything, collapsing the dial back to a fixed-pipeline calibration.
- No vendor-provided author or defect-rate signal. Cursor exposes path-based routing but not author-trust or directory-defect-rate routing. Teams encode those manually in Custom, with no audit trail.
Tunable effort assumes a published curve and a routing policy that concentrates the higher tier on the runs that pay for it. Without both, the dial reduces to per-PR cost optionality with no signal benefit.
Example¶
A monorepo with mixed-risk paths uses a Custom policy to concentrate High effort:
Use High effort when:
- The PR touches src/auth/, src/payments/, or src/crypto/
- The diff exceeds 500 lines
- The author has fewer than 10 merged PRs in this repo
- More than half the changed files have had a post-incident commit in the last 90 days
Use Default effort otherwise.
A 30-line fix in tests/integration/ from a long-tenured engineer runs at Default — 0.7 bugs/run, fast, and silence on style reads as real silence. A 1,200-line refactor touching src/auth/session.ts from a first-time contributor runs at High — 0.95 bugs/run, and the reviewer reads each flag knowing the routing concentrated effort here on purpose.
Pinning Default at the team level and letting Custom escalate inverts biasing-up: it preserves the silence-as-output contract for routine work while making the higher tier opt-in by policy, not per-PR clicking.
Key Takeaways¶
- Effort levels in code review agents are the per-PR analogue of heuristic effort scaling and interactive effort sliders, but with the unit of work bound to one PR and the decision delegated to a reviewer or routing policy.
- The dial is meaningful only with a published bug-discovery curve. Effort labels without numbers are hedge words.
- Resolution rate gates whether the higher-effort flags are useful; published rates are not the same as precision, and teams that need a precision floor measure it themselves.
- Routing policies concentrate High on the axes that pay for it — file-path criticality, change size, author trust, historical defect rate — the heuristic siblings of risk-score threshold calibration. Pinning High globally collapses the pattern back to a fixed-pipeline calibration.
- Custom natural-language policies are themselves drift-prone instruction surfaces. Treat them like CLAUDE.md: review periodically, gate changes through eval.
Related¶
- Tiered Code Review — path-based effort routing as the human-review counterpart
- Signal Over Volume in AI Review — why silence is a valid output and why High-by-default backfires
- Self-Improving Code Review Agents — Learned Rules — Bugbot's other published lever, complementary to per-PR effort
- Heuristic Effort Scaling — agent-decided effort tiers, the alternative to operator-set per-PR levels
- Interactive Effort Sliders — per-turn reasoning budget control in Claude Code
/effort - Cost-Aware Agent Design — broader cost-routing context