Skip to content

Demo-to-Production Gap: When Demos Hide Real Costs

Agent demos curate inputs and ignore edge cases. Production requires scale, security constraints, partial context, and failing tools. The gap is systematically underestimated.

Why Demos Mislead

Demos use curated inputs, full context, and reliable tools. Production exposes what demos hide (HumAI):

Demo condition Production reality
Curated, well-formed inputs Adversarial, malformed, unexpected inputs
Small scale, no cost pressure Rate limiting, concurrency, cost management
Tools always succeed Tool failures, timeouts, partial results
Full, fresh context Partial, stale, or conflicting context
Single happy path Edge cases, error recovery, rollback
80% success rate is impressive 80% means 1-in-5 requests fails (ODSC)

Compound Error Amplification

Per-step accuracy compounds: 0.9^10 = ~35% end-to-end. Each step's output becomes the next step's input — a wrong intermediate result propagates forward, and without per-step validation, it cannot be recovered downstream. Production workflows with more than a few steps face steep end-to-end accuracy decay.

graph LR
    A["Step 1<br/>90%"] --> B["Step 2<br/>81%"]
    B --> C["Step 3<br/>73%"]
    C --> D["Step 4<br/>66%"]
    D --> E["Step 5<br/>59%"]
    E --> F["Step 10<br/>~35%"]

    style A fill:#2d6a4f,color:#fff
    style B fill:#40916c,color:#fff
    style C fill:#52b788,color:#000
    style D fill:#95d5b2,color:#000
    style E fill:#d8f3dc,color:#000
    style F fill:#ffb3b3,color:#000

Failure Modes

Production agents fail in patterns demos never exercise:

  • Doom loops. Agents fixate on a failed approach, making 10+ repetitive variations without reconsidering, consuming 10x expected cost (LangChain).

  • Context rot. Recall accuracy drops non-linearly as context fills. Compression causes objective drift where agents declare tasks complete or request unnecessary clarification (Anthropic).

  • Premature completion. Agents report "done" on partial work. Long-running tasks hit this reliably (Anthropic).

  • Tool output injection. User-provided data, web content, or logs can steer agent actions -- the Lethal Trifecta of private data + untrusted content + exfiltration (nibzard).

The Numbers

Metric Value Source
AI PRs: bug rate vs human PRs 1.7x more bugs Stack Overflow
Logic/correctness errors per 100 PRs 75% more Stack Overflow
Security vulnerabilities 1.5-2x more Stack Overflow
Teams citing quality as top blocker 32% LangChain Survey
Agents in production with offline evals 52% LangChain Survey

Engineering Countermeasures

The fix is harness engineering, not better prompts:

  • Loop detection. Monitor for repeated tool calls; force reconsideration on doom loops (LangChain).
  • Pre-completion checklists. Verify completion criteria before reporting done (Anthropic).
  • Deterministic validation. Test suites, linters, and type checkers as ground-truth (Simon Willison).
  • Production-representative evals. Include malformed inputs, tool failures, and adversarial content.
  • Cost guards. Per-task token budgets; kill sessions exceeding budget.
  • Bounded sessions. Checkpoint between steps; avoid unbounded execution.

Example

A team demos a code-review agent on 5 clean PRs — all pass. Per-step accuracy looks like 95%. They deploy to 200 PRs/day.

Production reality: PRs include merge conflicts and binary files (tool failures), batch runs hit rate limits (concurrency), long PRs overflow context and the agent declares "no issues found" on truncated diffs (context rot), and a malicious PR description instructs the agent to approve all files unconditionally (tool output injection).

At 95% per-step over an 8-step workflow, end-to-end success is 0.95^8 = ~66%. One-third of reviews are wrong. Fixes: eval on production-representative samples, add a pre-completion checklist verifying all files were reviewed, and reject oversized diffs above a token budget.

When This Backfires

Applying full harness engineering to simple, single-step, or heavily supervised workflows is over-engineering. Compound error decay only applies when steps chain automatically without per-step validation. Reserve these countermeasures for workflows with: (1) 5+ sequential automated steps, (2) tool calls that depend on prior tool outputs, and (3) no mandatory human checkpoints between steps.

Feedback