Skip to content

Tunable Effort Levels for Code Review Agents

Expose review depth as a per-PR dial backed by a published bug-discovery curve, so reviewers and routing policies trade thoroughness against cost where it matters.

The Primitive

A tunable code-review agent ships with named effort levels — each pinning a point on a published bug-discovery curve. A reviewer or routing policy picks one per PR.

Cursor's Bugbot externalised this dial on 2026-05-11:

Level Bug-discovery rate Reviewer-stated trade-off
Default 0.7 bugs/run Optimised for efficiency and speed
High 0.95 bugs/run "More reasoning time; more expensive and slower, but finds more bugs"
Custom Operator-defined Natural-language policy: "Describe when Bugbot should use default or high effort"

High finds 35% more bugs at constant 80% resolution rate — additional flags are addressed at merge time, not silently dismissed.

The per-PR analogue of heuristic effort scaling (agent-tiered per query) and interactive effort sliders (operator-tiered per turn) — here the unit is one PR and the decision sits with a reviewer or routing policy.

Why a Dial, Not a Constant

Single-calibration agents force a compromise. Calibrate for thoroughness and routine PRs drown in commentary — developers override more than 30% of flags until the tool is functionally disabled. Calibrate for signal and the agent misses real regressions in high-stakes code.

GitHub Copilot's review began as the implicit binary form: in 29% of reviews the agent stays silent and in 71% it surfaces actionable feedback. A multi-level dial generalises it — silence is the lowest rung, deep agentic exploration the highest. By mid-2026 Copilot had externalised the dial too: admins set a per-repository analysis tier of low — the fast, cost-efficient default — or a new medium tier that routes complex logic, security-sensitive code, and cross-service changes to a higher-reasoning model (GitHub, 2026-06-02), and weeks later widened the surface with new configurations and controls for tuning the review agent (GitHub, 2026-06-12). Cursor binds the choice to one PR, Copilot to one repository — coarser-grained, but the same effort-routing primitive.

Calibration is the Pattern

Effort labels are hedge words without a published curve.

Cursor publishes two metrics:

  • Bug-discovery rate — bugs found per run, measured on BugBench, a curated benchmark of real diffs with human-annotated bugs (Building Bugbot)
  • Resolution rate — share of flagged bugs authors address at merge, classified by an LLM-as-judge validated against humans

Holding resolution rate constant across levels is the calibration commitment: High costs more but does not flood with low-quality flags. Resolution rate gates whether the higher-effort flags are useful.

Routing Axes

The dial composes with structured routing across four axes:

graph TD
    PR[Pull Request] --> Router{Routing policy}
    Router -->|Critical path<br/>auth, payments, crypto| High[High effort]
    Router -->|Large changeset<br/>1000+ LOC| High
    Router -->|Low-trust author<br/>new contributor| High
    Router -->|High historical defect rate<br/>in touched dirs| High
    Router -->|Otherwise| Default[Default effort]
  • File-path criticality — auth, payment, or crypto paths route to High. Same axis tiered code review uses for human escalation.
  • Change size — diffs above an LOC threshold route to High.
  • Author trust — first-time contributors or external forks route to High. Same signal CODEOWNERS already encodes.
  • Historical defect rate — directories with elevated post-merge bug rates route to High. Cursor exposes no signal; encode in Custom.

Cursor's Custom level encodes this as a natural-language policy evaluated per PR.

Cost-Performance Tie

High-effort review on critical paths is cheaper than post-merge remediation. Cursor's per-run pricing is $1.00-$1.50 by PR size, and effort levels require usage-based billing.

The arithmetic only holds when routing concentrates High on high-stakes runs. A reviewer toggling High on every PR pays the premium and accumulates the alert fatigue the dial existed to avoid.

When This Backfires

  • Resolution rate is not precision. It counts both "developer fixed it" and "developer made it go away." The 35%-more-bugs claim at High does not rule out more false positives that authors muted. Teams needing a precision floor (security under 3% FPR, style under 2%) measure noise themselves.
  • Routing policy drift in Custom. Natural-language Custom is an instruction file with the same primacy and drift issues. A correct policy in May 2026 may misroute six months later, and no eval is exposed by default.
  • High-effort default creates alert fatigue. Signal over volume is the dominant trust factor. Pinning High at the policy level burns attention on routine PRs and un-funds the high-stakes runs the dial protects.
  • Cost-blind defaults. Without per-PR cost ranges, reviewers toggle High on everything, collapsing the dial back to a fixed-pipeline calibration.
  • No vendor-provided author or defect-rate signal. Cursor exposes path-based routing but not author-trust or directory-defect-rate routing. Teams encode those manually in Custom, with no audit trail.

Tunable effort assumes a published curve and a routing policy that concentrates the higher tier on the runs that pay for it. Without both, the dial reduces to per-PR cost optionality with no signal benefit.

Example

A monorepo with mixed-risk paths uses a Custom policy to concentrate High effort:

Use High effort when:
- The PR touches src/auth/, src/payments/, or src/crypto/
- The diff exceeds 500 lines
- The author has fewer than 10 merged PRs in this repo
- More than half the changed files have had a post-incident commit in the last 90 days

Use Default effort otherwise.

A 30-line fix in tests/integration/ from a long-tenured engineer runs at Default — 0.7 bugs/run, fast, and silence on style reads as real silence. A 1,200-line refactor touching src/auth/session.ts from a first-time contributor runs at High — 0.95 bugs/run, and the reviewer reads each flag knowing the routing concentrated effort here on purpose.

Pinning Default at the team level and letting Custom escalate inverts biasing-up: it preserves the silence-as-output contract for routine work while making the higher tier opt-in by policy, not per-PR clicking.

Key Takeaways

  • Effort levels in code review agents are the per-PR analogue of heuristic effort scaling and interactive effort sliders, but with the unit of work bound to one PR and the decision delegated to a reviewer or routing policy.
  • The dial is meaningful only with a published bug-discovery curve. Effort labels without numbers are hedge words.
  • Resolution rate gates whether the higher-effort flags are useful; published rates are not the same as precision, and teams that need a precision floor measure it themselves.
  • Routing policies concentrate High on the axes that pay for it — file-path criticality, change size, author trust, historical defect rate — the heuristic siblings of risk-score threshold calibration. Pinning High globally collapses the pattern back to a fixed-pipeline calibration.
  • Custom natural-language policies are themselves drift-prone instruction surfaces. Treat them like CLAUDE.md: review periodically, gate changes through eval.
Feedback