Completion Failure Taxonomy¶

Not every rejected completion is a model failure. A quarter of real-world completion failures trace to integration problems — when the tool fires, what context it sends, and whether the suggestion was even needed.

The three failure categories¶

Code4Me collected 600K+ real completions from 1,200+ developers across 12 languages. The researchers analyzed 8,312 failures and found three categories with stable proportions. [Source: Izadi et al., ICSE 2024]

pie title Completion Failure Distribution (n = 8,312)
    "Model-Oriented (66.3%)" : 66.3
    "Application-Oriented (24.4%)" : 24.4
    "User Override (9.3%)" : 9.3

Model-oriented errors (66.3%)¶

The model produced wrong output. Two sub-types:

Sub-type	Count	Examples
Token-level mistakes	3,835	Wrong variable name, incorrect function call, bad literal, wrong type
Statement-level errors	1,676	Wrong parameter count, incorrect semantics, early/late termination, rambling output

Better models directly reduce this category. An Accenture deployment of GitHub Copilot reported ~30% acceptance, versus the study's 4.91% on InCoder/UniXcoder/CodeGPT. [Source: GitHub/Accenture study]

Application-oriented errors (24.4%)¶

The integration layer caused the failure, not the model:

Sub-type	Count	Implication
Mid-token invocation	1,173	Completion triggered while the developer was mid-keystroke — the partial token corrupted the prompt
Insufficient context	482	The IDE sent too little surrounding code for the model to produce a useful completion
Redundant invocation	240	Completion fired when no suggestion was needed — wasting a round-trip and interrupting flow

Nearly one in four failures had nothing to do with model capability. This is the category agent builders can act on.

User overrides (9.3%)¶

The model output was acceptable but rejected:

Sub-type	Count	Meaning
Correct but rejected	605	Model predicted correctly; developer chose to type it themselves
Valid but unpreferred	112	Output was functionally correct but didn't match developer's style or intent

Not true failures — the irreducible gap between prediction and developer intent.

The benchmark gap¶

The study's main finding: offline evaluations substantially misrepresent real-world effectiveness.

Setting	Metric behavior
Offline (synthetic test sets)	Models score well on curated, clean inputs with full context
Online (real IDE usage)	4.91% average acceptance rate across all models and languages

Corroboration: LLMs achieve 84–89% on synthetic benchmarks but only 25–34% on real-world class-level tasks. [Source: arxiv 2510.26130]

The gap comes from:

Benchmark inputs are clean; real code has typos, partial expressions, and mid-edit states
Benchmarks provide full file context; real invocations often have truncated context
Benchmarks measure correctness; real usage also needs timing and style match

Practical implications for agent builders¶

1. Audit the integration layer, not just the model¶

If ~25% of failures are application-oriented, improving the model alone hits diminishing returns. Measure and improve:

Invocation timing: debounce triggers to avoid mid-token firing
Context assembly: include surrounding code, imports, and type information
Relevance gating: suppress completions when editing patterns suggest none is needed

2. Use real-world telemetry for evaluation¶

Synthetic benchmarks rank models but do not predict user acceptance. Track acceptance rate, time-to-accept, and rejection reasons from actual usage. RepoMasterEval confirms realistic benchmarks correlate with online acceptance rates. [Source: arxiv 2408.03519]

3. Treat user overrides as signal, not noise¶

Roughly 1-in-10 suggestions is correct but unwanted — a signal for style mismatches and intent that can drive personalization.

4. Language-specific performance varies sharply¶

InCoder led across 12 languages, but mainstream ones (Python, Java) scored higher than less common ones. Do not assume Python performance predicts Rust or Kotlin — evaluate per-language.

When this taxonomy backfires¶

The 66 / 24 / 9 split is a useful prior, not a fixed budget:

Ratios are model- and cohort-specific. The study used first-gen code LMs (InCoder, UniXcoder, CodeGPT). Better models shrink the model-oriented share and raise the relative weight of integration errors.
Integration gains plateau. Smart-invocation work raised acceptance from ~4.9% to ~18.6% [Source: Koohestani et al., arxiv 2405.14753]. Past that, gains come from model capability and context quality, not more timing heuristics.
Narrow cohorts may skip harness work. Single-language teams on recent models often clear the bar off-the-shelf; the "just upgrade the model" steelman holds in that regime.
Non-mainstream languages invert priorities. For Rust, Kotlin, or niche DSLs, thin training data dominates; invocation tuning cannot compensate.
Override data needs good instrumentation. If telemetry cannot separate "rejected because wrong" from "rejected because already typed", the 9.3% bucket is noise.

Key Takeaways¶

Two-thirds of completion failures are model errors; one quarter are integration failures — fix both
Mid-token invocation is the single largest application-oriented failure mode (1,173 of 2,030 cases)
Offline benchmarks systematically overstate real-world completion quality — use telemetry-derived evals
~10% of rejected completions were actually correct — user override data is a feedback signal, not waste

Benchmark-Driven Tool Selection for Code Generation — Why synthetic benchmarks hide language-specific and task-specific weaknesses
Instruction-Guided Code Completion — When models complete code correctly but ignore structural constraints
Demo-to-Production Gap — The systematic gap between curated demos and production reality
pass@k and pass^k Metrics — Separating capability from consistency in agent evaluation
RAG Agent Reliability Problem Map — Retrieval-specific failure taxonomy that extends completion-failure categories