Human-Equivalent Hours for Autonomous Coding Agent Productivity¶
Estimate the human engineering hours an autonomous agent's output would have taken — credible only on PR-gated sessions with a paired downstream signal.
When the Metric Is Credible¶
Human-equivalent hours is an estimator of counterfactual human time, not a direct measurement of output. It is credible under specific conditions and misleading outside them. Apply it only when:
- Sessions terminate in a merged PR or pass an independent quality classifier. Cognition's calibration includes a PR-merged session only if any of its PRs merge; non-PR sessions go through a classifier that drops 1–20% as unproductive (Cognition, 2026-06-04).
- The aggregate covers enough sessions to escape the noise floor. Their held-out r_log = 0.74 places ~50% of estimates within a factor of 2; below ~20 sessions, month-over-month deltas sit inside that band (Cognition, 2026-06-04).
- A second, observed signal trends alongside it. PR review time, defect rate, or merge-to-revert ratio. Without one, the metric is unanchored — see When This Backfires.
Outside these conditions, prefer cost per merged PR alongside review-time-to-merge — both are observed, not estimated.
The Definition¶
Cognition operationalises the metric by asking "how long would a human engineer have taken to produce the same output?" — chosen because hours already denominate salaries and contractor rates, so the result is directly comparable to existing finance and headcount instruments (Cognition, 2026-06-04).
The estimator rests on four design principles:
- Reason about the human's path, not the agent's — discount retries, environment setup, and non-core artifacts the human would not have produced.
- Credit only work the user did not specify — measure the agent's independent contribution against the user's initial problem statement, not the full diff.
- Account for codebase familiarity — infer the exploration time a human would have needed in an unfamiliar codebase.
- Assume relevant expertise — the reference engineer already has the required skills; do not credit the agent for skill-substitution.
The uncalibrated model is corrected via log-space linear regression:
h = 2.28 × m^0.923
where m is the uncalibrated estimate and h the corrected human-hours figure. A simplified form uses a single multiplicative constant of 2.08 with negligible impact on metrics. Calibration used 258 sessions from 126 users across enterprise customers; held-out r_log = 0.74 on 233 sessions; F(1,231) = 279.9, p < 10⁻⁵ (Cognition, 2026-06-04).
Why Code Volume Is Not the Metric¶
A naive regression of total lines changed against human-time estimates produces R²_log = 0.27 — code volume captures roughly a quarter of the variance in productive output (Cognition, 2026-06-04). This is the empirical case against task-completion-rate and PR-count metrics: they correlate weakly with the value the team is actually paying for. Under bottleneck migration the cheap part of the work — generation — is exactly what those metrics count.
Why It Works¶
Engineering value is already denominated in human time. Salaries, contractor rates, and project estimates all use hours; converting agent output back into hours makes ROI directly comparable to the instruments finance and headcount planning already run (Cognition, 2026-06-04). The mechanism is denominator alignment, not ground-truth measurement — the metric works because it speaks the language of the decisions it informs (renew the seat, raise the cap, hire instead).
The denominator is urgent right now. Agentic workloads carry 58.9% of token volume on Vercel's AI Gateway, up from 31.6% six months earlier — tool-using requests are ~2.6× more token-heavy than the rest (Vercel AI Gateway production index, 2026-05-12). Uber capped employees at $1,500/month per agentic coding tool after burning the annual AI budget in four months (TechCrunch, 2026-06-02). Token spend has a denominator; agent output, until now, did not.
When This Backfires¶
The metric estimates counterfactual human time. Every failure mode below traces back to that one property.
- High-context maintenance on familiar codebases. A randomized controlled trial of experienced open-source developers measured a 19% slowdown with AI tools while developers still reported a 20% speedup — a 39-point perception gap (METR, 2025-07-10). Cognition's model is calibrated against user reports and its corrected estimates still sit 1.4× below those reports (Cognition, 2026-06-04) — consistent with self-report inflation, not independent of it. Pair with an observed downstream signal; the productivity-experience paradox is the warning that perception and reality diverge here.
- Downstream cost can absorb the gain. AI-assisted teams complete 21% more tasks and merge 98% more PRs while PR review time rises 91% — the bottleneck migrates (Osmani, 2025). An hours-saved figure that ignores review-time-spent is half a ledger.
- Task selection bias inflates apparent value. Agents get assigned the tasks they are best at; the reference human is then estimated for tasks pre-selected to favour the agent. Compare baselines on stratified task mixes, not aggregate session counts.
- Small-team noise floor. At r_log = 0.74, ~50% of per-session estimates fall within a factor of 2. A 10-session month sits inside that band; reading a 30% month-over-month change as signal is reading noise.
- Greenfield work has no stable reference. "How long would a human have taken?" assumes a stable counterfactual. For novel problems where no comparable human baseline exists, the denominator is fabricated and the resulting hours figure is no more grounded than an opinion.
Example¶
A platform team runs Devin and Claude Code across two months. They want to defend (or cancel) the $1,500/seat agentic-coding budget Uber-style caps would impose.
Before — counting completions:
Month 1: 47 PRs merged, 12,400 lines changed, $9,200 spend
Month 2: 51 PRs merged, 11,800 lines changed, $11,400 spend
Lines changed and PR counts both rise; spend rises faster; the conversation stalls on whether 47 PRs are "worth" $9,200.
After — denominating in human-equivalent hours, with downstream signals:
Month 1: 47 PRs merged → 184 estimated human-hours
PR review time: 38h spent; defect-rate flat
Implied rate: $9,200 / 184h = $50/h
Month 2: 51 PRs merged → 201 estimated human-hours
PR review time: 61h spent; defect-rate flat
Implied rate: $11,400 / 201h = $57/h
The estimate is calibrated on PR-merged sessions only (Cognition's gate). The implied $/h is now directly comparable to the team's loaded hourly rate. The 61h of review time is the observed signal that anchors the estimate — if review time were rising faster than agent hours, the gain is being absorbed downstream and the metric must say so.
Key Takeaways¶
- The metric: how long would a human engineer have taken to produce the same output — chosen because engineering value is already denominated in hours (Cognition, 2026-06-04).
- Cognition's calibration:
h = 2.28 × m^0.923(or a simplified 2.08× constant) fit on 258 sessions across 126 users; r_log = 0.74 on held-out data; ~50% within a factor of 2. - Code volume is empirically rejected as the metric — lines-changed regression returns R²_log = 0.27 against human estimates.
- Anti-gaming: include sessions only if a PR merges; for non-PR sessions, run an unproductive-session classifier (drops 1–20%).
- The signal inherits self-report inflation. Pair with an observed downstream signal (PR review time, defect rate) — METR's RCT measured a 19% slowdown while developers perceived a 20% speedup (METR, 2025-07-10).
- Apply only on PR-gated, multi-month aggregates with a paired downstream signal; small samples and greenfield work sit inside the noise floor.
Related¶
- The Productivity-Experience Paradox in AI-Assisted Development — perceived productivity can rise while experience declines; the inflation channel that hours-saved estimates inherit
- The Bottleneck Migration When Humans Supervise Agents — review time absorbs the generation gain; the downstream signal the metric must be paired with
- Copilot vs Claude Billing Semantics for Enterprise Teams — the cost-side denominator the agent-hours figure is being compared against
- Token-Cost Profiling and Reduction for Always-On Agentic Workflows — the spend-side instrumentation that anchors the ROI ratio
- Rigor Relocation: Engineering Discipline with AI Agents — verification cost shifts that show up only when the metric is paired with downstream signals