Skip to content

The Reasoning-Complexity Trade-off

The reasoning-complexity trade-off: more capable models produce more bloated, coupled code, not cleaner architecture — and output volume predicts the decay.

The Finding

Zhu, Tsantalis, and Rigby (2026) audited technical debt in AI-generated software across single-file tasks and agent-generated systems. Three findings:

  • Machine signature of defects — AI-generated code carries a distinct flaw pattern, not a smaller version of human flaws.
  • Reasoning-Complexity Trade-off — capability and architectural quality move in opposite directions.
  • Volume-Quality Inverse Law — code volume is a near-perfect predictor of structural degradation, the same bloat tracked in Abstraction Bloat.

Functional correctness does not predict maintainability. Detailed prompting does not produce smaller, less-coupled code (Zhu et al.).

Why It Matters

The default upgrade path — swap to the next-generation model — buys capability and pays in maintenance debt. AI-assisted repos show the same direction independently: a 76% rise in LOC and 39% rise in cognitive complexity, an 8x spike in duplicated blocks 2021-2024, and a refactoring share that fell from 25% to under 10% of commits.

graph LR
    A[Weaker model] --> B[Smaller, simpler output]
    C[Stronger model] --> D[Larger, more comprehensive output]
    B --> E[Lower coupling]
    D --> F[Higher coupling per line]
    E --> G[Maintainable]
    F --> H[Architectural decay]

What Doesn't Fix It

Tests passing. Functional correctness does not predict structural quality (Zhu et al.). Green CI is consistent with steeply declining maintainability.

Longer prompts. Detailed instructions do not reverse the trend at the model layer. The Fowler/Garg notification case study records a single-channel request returning rate limiting, analytics, and webhooks — features the prompt did not request.

Bigger models. This is the trade-off itself (Zhu et al.).

What Does Help

Workflow gates that operate above the prompt layer:

  • Architectural foresight before generation. Design-first collaboration gates implementation behind explicit approval — no code until the approach is agreed.
  • Volume as a quality signal. Treat output size as a leading indicator; if line count is high relative to the requirement, structural degradation is the prior.
  • Post-generation cleanup. Entropy-reduction agents and scheduled garbage-collection runs (Fowler/Boeckeler) target bloat that prompt-time controls miss.
  • Deterministic enforcement. Cyclomatic complexity, function-length, and duplication thresholds catch what prompts cannot — see hooks for enforcement vs prompts for guidance.

When This Doesn't Apply

The trade-off framing has narrow applicability where bloat carries no maintenance cost:

  • Greenfield throwaway code. One-off scripts, demos, and prototypes are not maintained, so volume-quality drift has no observable cost surface.
  • Templated boilerplate. When LOC inflation comes from explicit scaffolding (CRUD, IaC, test fixtures), volume is not a structural signal — the inverse law's predictive power degrades.
  • Solo small repos. Without long-running maintenance horizons or shared ownership, structural degradation is local and tolerable.

Example

A team is choosing between two models for a billing-rules service. Both pass the test suite for the requested feature: apply a tiered discount across the free/pro/enterprise plans given a customer plan and order total.

Model A — smaller capability tier:

def apply_discount(plan: str, total: float) -> float:
    rates = {"free": 0.0, "pro": 0.05, "enterprise": 0.10}
    return total * (1 - rates.get(plan, 0.0))

Eight lines. One function. One responsibility.

Model B — larger capability tier, same prompt:

class DiscountStrategy(ABC):
    @abstractmethod
    def calculate(self, total: float) -> float: ...

class FreeDiscount(DiscountStrategy): ...
class ProDiscount(DiscountStrategy): ...
class EnterpriseDiscount(DiscountStrategy): ...

class DiscountCalculator:
    def __init__(self, strategy: DiscountStrategy): ...
    def apply(self, total: float) -> float: ...

class DiscountFactory:
    @staticmethod
    def create(plan: str) -> DiscountStrategy: ...

class DiscountAuditLog:
    def record(self, plan: str, total: float, applied: float): ...

6 classes, an abstract base, an unrequested audit log. Tests pass. The result satisfies the requirement and predicts the Volume-Quality Inverse Law: the stronger model's output is larger and more coupled — the rate-tier change next sprint now touches three files instead of one.

Key Takeaways

  • Capability gains in LLMs do not transfer to architectural quality — they trade against it
  • Tests passing and detailed prompting are insufficient countermeasures; the fix is structural, not prompt-level
  • Treat output volume as a leading indicator of structural degradation and gate strong-model output behind architecture-aware checks
Feedback