Skip to content

Agent Headcount as a Vanity Metric

"We have 11 AI agents in production" tells you nothing — agent count measures decomposition style, not capability or output.

Agent headcount is a vanity metric: it reports how many separate loops a team instantiated, not the work those loops ship. Boris Mann's framing, surfaced by Simon Willison on 2026-05-13, equates "11 AI agents" with "11 spreadsheets" or "11 browser tabs" — grammatically a number, operationally a non-answer (Simon Willison, 2026-05-13).

The Pattern

Leadership decks, vendor case studies, conference talks, and team chats report agent populations as evidence of investment maturity ("we're up to 7 agents now"). The number is easy to count, easy to compare across teams, easy to put on a slide — and uncorrelated with whether the system ships more work, fewer bugs, or lower cost.

Why It Fails

The compute-controlled evidence is unambiguous. Once reasoning-token budget is held constant across architectures, single-agent systems "consistently match or outperform" multi-agent systems on multi-hop reasoning across the Qwen3, DeepSeek-R1, and Gemini 2.5 model families (Tran & Kiela, 2026, arxiv:2604.02460). Earlier multi-agent gains reflect "unaccounted computation and context effects rather than inherent architectural benefits".

Variance is the giveaway. Across 260 configurations and six benchmarks, the same multi-agent system swings from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, and tool-heavy tasks pay a 2–6x efficiency penalty in the multi-agent form (Kim et al., 2025, arxiv:2512.08296). "We have N agents" cannot distinguish the +80% configuration from the -70% one — architecture-task alignment is the explanatory variable, count is not.

Production teams agree. Anthropic's research mode burns ~15x more tokens than chat and 4x more than a single agent, and earns the cost only on breadth-first parallel research — the wrong choice for most coding and any domain that needs shared context (Anthropic, 2025). Shopify's ICML 2025 Sidekick lessons are blunter: "avoid multi-agent architectures early — simple single-agent systems can handle more complexity than you might expect" (Shopify Engineering, 2025).

Why It Works

It works as a metric — for the reporter, not the reader. The count is cheap, comparable, and brag-shaped, which are the canonical Goodhart conditions. It survives in team chats because nothing else has been instrumented: in the absence of pass-rate, revision-rate, or cost-per-merged-PR telemetry, count fills the vacuum. The mechanism by which it actively misleads is the variance above — count is downstream of topology choice, not a proxy for sophistication.

Substitute Metrics

Track outputs of the agent system, not the population of it:

  • Pass rate plotted against revision rate over time. Pass rate climbing while revision rate stays flat or declines is the healthy pattern; both climbing is acceleration whiplash (Digital Applied, 2026).
  • Cost per merged PR. Ties token spend to a unit of work the business cares about — count cannot.
  • Outcome rate, not completion rate. "Output is easy to measure and tells you almost nothing; outcome is harder to measure and tells you everything" (Digital Applied: AI Agent ROI, 2026).
  • Percentage of test/lint failures auto-remediated. Throughput per dollar against a fixed agent surface.

When This Backfires

  • Zero outcome telemetry. Retiring the headcount metric without standing up the dashboards above leaves a vacuum filled by even worse proxies — lines of AI-generated code, prompt counts, suggestion-acceptance ratios. Build the replacement before retiring the placeholder.
  • Genuinely decomposable parallel work. For breadth-first research, multi-vendor extraction, or batch document processing, more agents helps up to the parallelisation limit (Kim et al., 2025); Anthropic's research mode is the canonical case. Headcount in that narrow class is a weak-but-positive signal. The anti-pattern is generalising from it.
  • Non-technical executive reporting. "8 production agents" lands on a board slide; "0.72 cost per merged PR at 31% unedited acceptance" requires a briefing. Publish both — do not suppress either.

Example

Before — counting agents as a maturity signal:

Q1 platform update: We now run 11 agents in production
(planner, retriever, code-writer, reviewer, deployer, doc-writer,
test-generator, log-summarizer, incident-triager, PR-labeler, release-notes).
Up from 4 at end of Q4. Team velocity is accelerating.

The deck reports a count without an outcome. Two of those agents may drive 90% of merged work; three may be net-negative on cost. The number conceals which configuration is winning.

After — counting outputs of the agent system:

Q1 platform update:
- 312 PRs merged with agent assistance (Q4: 184), 41% unedited (Q4: 22%)
- Median cost per merged PR: $1.10 (Q4: $1.80)
- Auto-remediated lint/test failures: 1,840 (Q4: 620)
- Agent population reduced 11 to 6; planner+code-writer+reviewer
  drove 87% of merged work, the other 8 were retired or merged.

Throughput and cost-per-unit are up while agent count is down. The vanity metric would have flagged the consolidation as regression; the outcome metrics flag it as the win.

Key Takeaways

  • Counting agents is a category error — like counting spreadsheets or browser tabs, it reports a number without describing value
  • Compute-equalised benchmarks show single-agent matches or beats multi-agent on most tasks; the same system swings +80% to -70% by task fit, not count
  • Track pass rate, revision rate, cost per merged PR, and unedited-merge percentage — headcount cannot distinguish winning configurations from losing ones
  • Multi-agent earns its token cost on a narrow class of breadth-first parallel work; outside that class, more agents adds coordination overhead without value
Feedback