Completion Failure Taxonomy¶
Not every rejected completion is a model failure. A quarter of real-world completion failures trace to integration problems — when the tool fires, what context it sends, and whether the suggestion was even needed.
The three failure categories¶
Code4Me collected 600K+ real completions from 1,200+ developers across 12 languages. The researchers analyzed 8,312 failures and found three categories with stable proportions. [Source: Izadi et al., ICSE 2024]
pie title Completion Failure Distribution (n = 8,312)
"Model-Oriented (66.3%)" : 66.3
"Application-Oriented (24.4%)" : 24.4
"User Override (9.3%)" : 9.3
Model-oriented errors (66.3%)¶
The model produced wrong output. Two sub-types:
| Sub-type | Count | Examples |
|---|---|---|
| Token-level mistakes | 3,835 | Wrong variable name, incorrect function call, bad literal, wrong type |
| Statement-level errors | 1,676 | Wrong parameter count, incorrect semantics, early/late termination, rambling output |
Better models directly reduce this category. An Accenture deployment of GitHub Copilot reported ~30% acceptance, versus the study's 4.91% on InCoder/UniXcoder/CodeGPT. [Source: GitHub/Accenture study]
Application-oriented errors (24.4%)¶
The integration layer caused the failure, not the model:
| Sub-type | Count | Implication |
|---|---|---|
| Mid-token invocation | 1,173 | Completion triggered while the developer was mid-keystroke — the partial token corrupted the prompt |
| Insufficient context | 482 | The IDE sent too little surrounding code for the model to produce a useful completion |
| Redundant invocation | 240 | Completion fired when no suggestion was needed — wasting a round-trip and interrupting flow |
Nearly one in four failures had nothing to do with model capability. This is the category agent builders can act on.
User overrides (9.3%)¶
The model output was acceptable but rejected:
| Sub-type | Count | Meaning |
|---|---|---|
| Correct but rejected | 605 | Model predicted correctly; developer chose to type it themselves |
| Valid but unpreferred | 112 | Output was functionally correct but didn't match developer's style or intent |
Not true failures — the irreducible gap between prediction and developer intent.
The benchmark gap¶
The study's main finding: offline evaluations substantially misrepresent real-world effectiveness.
| Setting | Metric behavior |
|---|---|
| Offline (synthetic test sets) | Models score well on curated, clean inputs with full context |
| Online (real IDE usage) | 4.91% average acceptance rate across all models and languages |
Corroboration: LLMs achieve 84–89% on synthetic benchmarks but only 25–34% on real-world class-level tasks. [Source: arxiv 2510.26130]
The gap comes from:
- Benchmark inputs are clean; real code has typos, partial expressions, and mid-edit states
- Benchmarks provide full file context; real invocations often have truncated context
- Benchmarks measure correctness; real usage also needs timing and style match
Practical implications for agent builders¶
1. Audit the integration layer, not just the model¶
If ~25% of failures are application-oriented, improving the model alone hits diminishing returns. Measure and improve:
- Invocation timing: debounce triggers to avoid mid-token firing
- Context assembly: include surrounding code, imports, and type information
- Relevance gating: suppress completions when editing patterns suggest none is needed
2. Use real-world telemetry for evaluation¶
Synthetic benchmarks rank models but do not predict user acceptance. Track acceptance rate, time-to-accept, and rejection reasons from actual usage. RepoMasterEval confirms realistic benchmarks correlate with online acceptance rates. [Source: arxiv 2408.03519]
3. Treat user overrides as signal, not noise¶
Roughly 1-in-10 suggestions is correct but unwanted — a signal for style mismatches and intent that can drive personalization.
4. Language-specific performance varies sharply¶
InCoder led across 12 languages, but mainstream ones (Python, Java) scored higher than less common ones. Do not assume Python performance predicts Rust or Kotlin — evaluate per-language.
When this taxonomy backfires¶
The 66 / 24 / 9 split is a useful prior, not a fixed budget:
- Ratios are model- and cohort-specific. The study used first-gen code LMs (InCoder, UniXcoder, CodeGPT). Better models shrink the model-oriented share and raise the relative weight of integration errors.
- Integration gains plateau. Smart-invocation work raised acceptance from ~4.9% to ~18.6% [Source: Koohestani et al., arxiv 2405.14753]. Past that, gains come from model capability and context quality, not more timing heuristics.
- Narrow cohorts may skip harness work. Single-language teams on recent models often clear the bar off-the-shelf; the "just upgrade the model" steelman holds in that regime.
- Non-mainstream languages invert priorities. For Rust, Kotlin, or niche DSLs, thin training data dominates; invocation tuning cannot compensate.
- Override data needs good instrumentation. If telemetry cannot separate "rejected because wrong" from "rejected because already typed", the 9.3% bucket is noise.
Key Takeaways¶
- Two-thirds of completion failures are model errors; one quarter are integration failures — fix both
- Mid-token invocation is the single largest application-oriented failure mode (1,173 of 2,030 cases)
- Offline benchmarks systematically overstate real-world completion quality — use telemetry-derived evals
- ~10% of rejected completions were actually correct — user override data is a feedback signal, not waste
Related¶
- Benchmark-Driven Tool Selection for Code Generation — Why synthetic benchmarks hide language-specific and task-specific weaknesses
- Instruction-Guided Code Completion — When models complete code correctly but ignore structural constraints
- Demo-to-Production Gap — The systematic gap between curated demos and production reality
- pass@k and pass^k Metrics — Separating capability from consistency in agent evaluation
- RAG Agent Reliability Problem Map — Retrieval-specific failure taxonomy that extends completion-failure categories