Agent Headcount as a Vanity Metric¶

"We have 11 AI agents in production" tells you nothing — agent count measures decomposition style, not capability or output.

Agent headcount is a vanity metric: it reports how many loops a team instantiated, not the work they ship. Boris Mann's framing, surfaced by Simon Willison on 2026-05-13, equates "11 AI agents" with "11 spreadsheets" — a number, not an answer (Simon Willison, 2026-05-13).

The pattern¶

Decks, vendor case studies, and team chats report agent populations as investment maturity. The number is easy to count, compare, and slide — and uncorrelated with whether the system ships more work or costs less.

Why it fails¶

With reasoning-token budget held constant, single-agent systems "consistently match or outperform" multi-agent ones on multi-hop reasoning across the Qwen3, DeepSeek-R1, and Gemini 2.5 families (Tran & Kiela, 2026, arxiv:2604.02460). Earlier multi-agent gains reflect "unaccounted computation and context effects rather than inherent architectural benefits".

Variance is the giveaway: across 260 configurations and six benchmarks, the same system swings from +80.8% on decomposable financial reasoning to -70.0% on sequential planning, and tool-heavy tasks pay a 2–6x efficiency penalty (Kim et al., 2025, arxiv:2512.08296). "We have N agents" cannot distinguish the +80% configuration from the -70% one — architecture-task alignment is the explanatory variable.

Anthropic's research mode burns ~15x more tokens than chat, earning the cost only on breadth-first parallel research — wrong for most coding (Anthropic, 2025). Shopify's ICML 2025 Sidekick lessons are blunter: "avoid multi-agent architectures early — simple single-agent systems can handle more complexity than you might expect" (Shopify Engineering, 2025).

Why it works¶

It works for the reporter, not the reader: the count is cheap, comparable, and brag-shaped — the canonical Goodhart conditions. It survives because nothing else is instrumented: absent pass-rate, revision-rate, or cost-per-merged-PR telemetry, count fills the vacuum — though it is downstream of topology choice, not a proxy for sophistication.

Substitute metrics¶

Track the agent system's outputs, not its population:

Pass rate against revision rate over time. Pass rate climbing while revision rate stays flat is healthy; both climbing is acceleration whiplash (Digital Applied, 2026).
Cost per merged PR. Ties token spend to a unit of work the business cares about — count cannot.
Outcome rate, not completion rate. "Output is easy to measure and tells you almost nothing; outcome is harder to measure and tells you everything" (Digital Applied: AI Agent ROI, 2026).

When this backfires¶

Zero outcome telemetry. Retiring headcount without the dashboards above leaves a vacuum filled by worse proxies — AI-generated code lines, prompt counts, acceptance ratios. Build the replacement first.
Genuinely decomposable parallel work. For breadth-first research, multi-vendor extraction, or batch processing, more agents helps up to the parallelization limit (Kim et al., 2025). Headcount there is a weak-but-positive signal — the anti-pattern is generalising from it.
Non-technical reporting. "8 production agents" lands on a board slide; "0.72 cost per merged PR" needs a briefing. Publish both.

Example¶

Before — counting agents as a maturity signal:

Q1 platform update: We now run 11 agents in production
(planner, retriever, code-writer, reviewer, deployer, doc-writer,
test-generator, log-summarizer, incident-triager, PR-labeler, release-notes).
Up from 4 at end of Q4. Team velocity is accelerating.

The deck reports a count, not an outcome. Two agents may drive 90% of merged work while three are net-negative on cost — the number conceals which configuration is winning.

After — counting outputs of the agent system:

Q1 platform update:
- 312 PRs merged with agent assistance (Q4: 184), 41% unedited (Q4: 22%)
- Median cost per merged PR: $1.10 (Q4: $1.80)
- Auto-remediated lint/test failures: 1,840 (Q4: 620)
- Agent population reduced 11 to 6; planner+code-writer+reviewer
  drove 87% of merged work, the other 8 were retired or merged.

Throughput and cost-per-unit are up while agent count is down. The vanity metric flags the consolidation as regression; the outcome metrics flag the win.

Key Takeaways¶

Counting agents is a category error — like counting spreadsheets or browser tabs, it reports a number without describing value
Compute-equalised benchmarks show single-agent matches or beats multi-agent on most tasks; the same system swings +80% to -70% by task fit, not count
Track pass rate, revision rate, cost per merged PR, and unedited-merge percentage — headcount cannot distinguish winning configurations from losing ones
Multi-agent earns its token cost on a narrow class of breadth-first parallel work; outside that class, more agents adds coordination overhead without value

The AI Adoption Footprint — segmented adoption is a more honest org-level metric than aggregate headcount
Cargo Cult Agent Setup — adding agents copied from elsewhere multiplies count without adding capability
The Copy-Paste Agent — duplicated agent definitions inflate count while diluting maintainability
Cross-Component Interference in Agent Scaffolds — maximally-equipped multi-component agents lose to smaller subsets in 30-50% of tasks
Multi-Agent Topology Taxonomy — the topology choice that makes agent count look high or low
Over-Orchestrated Agent Architecture — the per-task design error this metric measures the aggregate of: reaching for multi-agent before a single loop has been tried