Demo-to-Production Gap: When Demos Hide Real Costs¶

Agent demos curate inputs and ignore edge cases. Production requires scale, security constraints, partial context, and failing tools. The gap is systematically underestimated.

Why Demos Mislead¶

Demos use curated inputs, full context, and reliable tools. Production exposes what demos hide (HumAI):

Demo condition	Production reality
Curated, well-formed inputs	Adversarial, malformed, unexpected inputs
Small scale, no cost pressure	Rate limiting, concurrency, cost management
Tools always succeed	Tool failures, timeouts, partial results
Full, fresh context	Partial, stale, or conflicting context
Single happy path	Edge cases, error recovery, rollback
80% success rate is impressive	80% means 1-in-5 requests fails (ODSC)

Compound Error Amplification¶

Per-step accuracy compounds: 0.9^10 = ~35% end-to-end. Each step's output becomes the next step's input — a wrong intermediate result propagates forward, and without per-step validation, it cannot be recovered downstream. Production workflows with more than a few steps face steep end-to-end accuracy decay.

graph LR
    A["Step 1<br/>90%"] --> B["Step 2<br/>81%"]
    B --> C["Step 3<br/>73%"]
    C --> D["Step 4<br/>66%"]
    D --> E["Step 5<br/>59%"]
    E --> F["Step 10<br/>~35%"]

    style A fill:#2d6a4f,color:#fff
    style B fill:#40916c,color:#fff
    style C fill:#52b788,color:#000
    style D fill:#95d5b2,color:#000
    style E fill:#d8f3dc,color:#000
    style F fill:#ffb3b3,color:#000

Failure Modes¶

Production agents fail in patterns demos never exercise:

Doom loops. Agents fixate on a failed approach, making 10+ repetitive variations without reconsidering, consuming 10x expected cost (LangChain).
Context rot. Recall accuracy drops non-linearly as context fills. Compression causes objective drift where agents declare tasks complete or request unnecessary clarification (Anthropic).
Premature completion. Agents report "done" on partial work. Long-running tasks hit this reliably (Anthropic).
Tool output injection. User-provided data, web content, or logs can steer agent actions -- the Lethal Trifecta of private data + untrusted content + exfiltration (nibzard).

The Numbers¶

Metric	Value	Source
AI PRs: bug rate vs human PRs	1.7x more bugs	Stack Overflow
Logic/correctness errors per 100 PRs	75% more	Stack Overflow
Security vulnerabilities	1.5-2x more	Stack Overflow
Teams citing quality as top blocker	32%	LangChain Survey
Agents in production with offline evals	52%	LangChain Survey

Engineering Countermeasures¶

The fix is harness engineering, not better prompts:

Loop detection. Monitor for repeated tool calls; force reconsideration on doom loops (LangChain).
Pre-completion checklists. Verify completion criteria before reporting done (Anthropic).
Deterministic validation. Test suites, linters, and type checkers as ground-truth (Simon Willison).
Production-representative evals. Include malformed inputs, tool failures, and adversarial content.
Cost guards. Per-task token budgets; kill sessions exceeding budget.
Bounded sessions. Checkpoint between steps; avoid unbounded execution.

Example¶

A team demos a code-review agent on 5 clean PRs — all pass. Per-step accuracy looks like 95%. They deploy to 200 PRs/day.

Production reality: PRs include merge conflicts and binary files (tool failures), batch runs hit rate limits (concurrency), long PRs overflow context and the agent declares "no issues found" on truncated diffs (context rot), and a malicious PR description instructs the agent to approve all files unconditionally (tool output injection).

At 95% per-step over an 8-step workflow, end-to-end success is 0.95^8 = ~66%. One-third of reviews are wrong. Fixes: eval on production-representative samples, add a pre-completion checklist verifying all files were reviewed, and reject oversized diffs above a token budget.

When This Backfires¶

Applying full harness engineering to simple, single-step, or heavily supervised workflows is over-engineering. Compound error decay only applies when steps chain automatically without per-step validation. Reserve these countermeasures for workflows with: (1) 5+ sequential automated steps, (2) tool calls that depend on prior tool outputs, and (3) no mandatory human checkpoints between steps.