Density-Normalized Quality Metrics Mask AI-Driven Code Growth¶

A density-normalized quality metric falls when AI adoption inflates the denominator faster than smells grow — the ratio reports code growth, not improvement.

Density-normalized quality metrics — architectural smells per KLOC, warnings per file, complexity per method — appear to fall after AI coding assistants are adopted, and platform teams cite the drop as evidence the tools improve code. A 151-repo causal study of Java codebases found the apparent improvement is arithmetic: smell counts stay flat (+1.1%, p = 0.82) while lines of code grow +12.8% (p = 0.003), mechanically producing the headline −6.7% density figure (p = 0.004) without a single architectural defect being removed (Larsen & Moghaddam, 2026, arxiv:2606.13298).

The Pattern¶

The metric ships as a single ratio — smell_count / loc, warnings / files, complexity / methods — and the period-over-period delta is presented as the quality signal. Adoption decks pair the falling ratio with the AI rollout date and infer causation. The numerator and denominator are rarely shown alongside the ratio, so readers cannot tell which moved.

Why It Fails¶

A causal estimator can hold everything else constant and the ratio still misleads, because the denominator is part of the treatment. The Larsen & Moghaddam study used a staggered difference-in-differences design with the Borusyak imputation estimator across 1,811 monthly Arcan snapshots of 74 agentic-AI-adopting Java repos against 77 propensity-matched controls; pre-trends were flat (Wald p = 0.90) and wild cluster bootstrap, Lee bounds, and stale-observation checks all held (Larsen & Moghaddam, 2026). The clean design still cannot rescue the ratio — the authors' own warning is explicit: "density-normalized outcomes can mislead when treatment affects system size."

An independent MSR '26 DiD study of Cursor adoption finds the symmetric shape from the opposite side: a "statistically significant, large, but transient" velocity gain paired with a "substantial and persistent" rise in static-analysis warnings and complexity (Wang et al., 2025, arxiv:2511.04427). Both papers triangulate to: AI grows the codebase faster than it grows architectural debt, and reports that frame that as a quality win are reading the denominator.

Why It Works¶

The ratio survives because it is the canonical cross-repo comparator — without normalization, a 10k-LOC repo and a 100k-LOC repo cannot be compared at all, and pre-AI tooling correctly treated density as a quality measure when the denominator drifted slowly. AI adoption broke that assumption: when treatment inflates the denominator faster than the numerator, the ratio crosses from quality signal to artifact, and no internal property of the ratio flags the transition.

Substitute Metrics¶

Report the decomposition, not the ratio:

Raw numerator and denominator alongside any density figure. Smell count and LOC, warning count and file count, complexity and method count — published together so the reader sees which moved. The Larsen & Moghaddam recommendation is "raw counts and explicit decomposition" (Larsen & Moghaddam, 2026).
Period-over-period delta on the numerator alone. A flat or rising raw smell count is the quality signal; a falling density with a flat numerator is the artifact warning.
Industry baselines for the denominator. GitClear's 2025 longitudinal analysis of 211M changed lines found AI-era refactor share fell from 25% to <10% while copy-paste share rose from 8.3% to 12.3% — denominator-inflating patterns documented at scale (GitClear, 2025). A density drop in a repo following the industry trend is presumptively artifact until decomposed.

When This Backfires¶

Net-deletion AI usage. Teams using AI for refactoring sweeps, dead-code removal, or migration consolidation may see LOC flat or shrinking; density changes there track real architectural movement and decomposition adds noise without value.
High-baseline-smell repos. A codebase entering AI adoption with already-saturated absolute smell counts can show genuine density falls as new LOC accretes against a fixed numerator — the decomposition shows the same story but does not falsify the ratio.
Tech stacks without mature smell detection. Arcan covers Java; for Go, Rust, or modern TypeScript-first stacks where architectural-smell tooling is weak, the numerator becomes noisy enough that neither density nor count is reliable. Decomposition does not rescue an untrustworthy numerator.

Example¶

Before — reporting a single ratio:

Q2 architecture review: AI adoption update
- Architectural smell density: -6.7% YoY (1.43 → 1.33 smells/KLOC)
- Statistically significant (p = 0.004)
- Conclusion: agentic AI adoption is improving architectural quality

The ratio fell, the p-value clears, and the conclusion follows — except the numerator and denominator are absent, so the reader cannot tell that the smell count was flat and LOC grew 13%.

After — reporting the decomposition:

Q2 architecture review: AI adoption update
- Total architectural smells: +1.1% YoY (n.s., p = 0.82)
- Lines of code: +12.8% YoY (p = 0.003)
- Derived smell density: -6.7% (denominator-driven; do not read as quality signal)
- Conclusion: smell count did not change; codebase grew. AI adoption is not
  improving architectural quality at the repo level over this window.

Same data, the conclusion inverts. The decomposition exposes that the density delta is downstream of LOC growth, not architectural cleanup.

Key Takeaways¶

A causal study of 151 Java repos shows agentic AI adoption leaves smell counts flat (+1.1%) while LOC grows +12.8% — the apparent −6.7% density "improvement" is denominator inflation, not architectural cleanup
Density-normalized metrics break as quality signals when treatment inflates the denominator faster than the numerator; the canonical pre-AI assumption that the denominator drifts slowly no longer holds
Always report the raw numerator and denominator alongside any quality density, and flag a density drop with a flat numerator as presumptively artifact until decomposed
The denominator artifact runs in both directions — single-ratio velocity, productivity, and quality dashboards built on X / LOC need the same decomposition discipline

Agent Headcount as a Vanity Metric — adjacent measurement failure where the easy-to-count number gets cited as outcome evidence
Shadow Tech Debt — the architectural drift that flat smell counts can still understate when AI bypasses structural understanding
LLM Code Review Overcorrection — companion misreading where the review signal is the artifact, not the code
The Reasoning-Complexity Trade-off — stronger models produce more bloated and coupled code; corroborates the LOC-inflation half of this anti-pattern
Vibe Coding — the consumption shape that drives the LOC-inflation denominator behind density artifacts