Skip to content

Completion Failure Taxonomy

Not every rejected completion is a model failure. A quarter of real-world completion failures trace to integration problems — when the tool fires, what context it sends, and whether the suggestion was even needed.

The three failure categories

Code4Me collected 600K+ real completions from 1,200+ developers across 12 languages. The researchers analyzed 8,312 failures and found three categories with stable proportions. [Source: Izadi et al., ICSE 2024]

pie title Completion Failure Distribution (n = 8,312)
    "Model-Oriented (66.3%)" : 66.3
    "Application-Oriented (24.4%)" : 24.4
    "User Override (9.3%)" : 9.3

Model-oriented errors (66.3%)

The model produced wrong output. Two sub-types:

Sub-type Count Examples
Token-level mistakes 3,835 Wrong variable name, incorrect function call, bad literal, wrong type
Statement-level errors 1,676 Wrong parameter count, incorrect semantics, early/late termination, rambling output

Better models directly reduce this category. An Accenture deployment of GitHub Copilot reported ~30% acceptance, versus the study's 4.91% on InCoder/UniXcoder/CodeGPT. [Source: GitHub/Accenture study]

Application-oriented errors (24.4%)

The integration layer caused the failure, not the model:

Sub-type Count Implication
Mid-token invocation 1,173 Completion triggered while the developer was mid-keystroke — the partial token corrupted the prompt
Insufficient context 482 The IDE sent too little surrounding code for the model to produce a useful completion
Redundant invocation 240 Completion fired when no suggestion was needed — wasting a round-trip and interrupting flow

Nearly one in four failures had nothing to do with model capability. This is the category agent builders can act on.

User overrides (9.3%)

The model output was acceptable but rejected:

Sub-type Count Meaning
Correct but rejected 605 Model predicted correctly; developer chose to type it themselves
Valid but unpreferred 112 Output was functionally correct but didn't match developer's style or intent

Not true failures — the irreducible gap between prediction and developer intent.

The benchmark gap

The study's main finding: offline evaluations substantially misrepresent real-world effectiveness.

Setting Metric behavior
Offline (synthetic test sets) Models score well on curated, clean inputs with full context
Online (real IDE usage) 4.91% average acceptance rate across all models and languages

Corroboration: LLMs achieve 84–89% on synthetic benchmarks but only 25–34% on real-world class-level tasks. [Source: arxiv 2510.26130]

The gap comes from:

  • Benchmark inputs are clean; real code has typos, partial expressions, and mid-edit states
  • Benchmarks provide full file context; real invocations often have truncated context
  • Benchmarks measure correctness; real usage also needs timing and style match

Practical implications for agent builders

1. Audit the integration layer, not just the model

If ~25% of failures are application-oriented, improving the model alone hits diminishing returns. Measure and improve:

  • Invocation timing: debounce triggers to avoid mid-token firing
  • Context assembly: include surrounding code, imports, and type information
  • Relevance gating: suppress completions when editing patterns suggest none is needed

2. Use real-world telemetry for evaluation

Synthetic benchmarks rank models but do not predict user acceptance. Track acceptance rate, time-to-accept, and rejection reasons from actual usage. RepoMasterEval confirms realistic benchmarks correlate with online acceptance rates. [Source: arxiv 2408.03519]

3. Treat user overrides as signal, not noise

Roughly 1-in-10 suggestions is correct but unwanted — a signal for style mismatches and intent that can drive personalization.

4. Language-specific performance varies sharply

InCoder led across 12 languages, but mainstream ones (Python, Java) scored higher than less common ones. Do not assume Python performance predicts Rust or Kotlin — evaluate per-language.

When this taxonomy backfires

The 66 / 24 / 9 split is a useful prior, not a fixed budget:

  • Ratios are model- and cohort-specific. The study used first-gen code LMs (InCoder, UniXcoder, CodeGPT). Better models shrink the model-oriented share and raise the relative weight of integration errors.
  • Integration gains plateau. Smart-invocation work raised acceptance from ~4.9% to ~18.6% [Source: Koohestani et al., arxiv 2405.14753]. Past that, gains come from model capability and context quality, not more timing heuristics.
  • Narrow cohorts may skip harness work. Single-language teams on recent models often clear the bar off-the-shelf; the "just upgrade the model" steelman holds in that regime.
  • Non-mainstream languages invert priorities. For Rust, Kotlin, or niche DSLs, thin training data dominates; invocation tuning cannot compensate.
  • Override data needs good instrumentation. If telemetry cannot separate "rejected because wrong" from "rejected because already typed", the 9.3% bucket is noise.

Key Takeaways

  • Two-thirds of completion failures are model errors; one quarter are integration failures — fix both
  • Mid-token invocation is the single largest application-oriented failure mode (1,173 of 2,030 cases)
  • Offline benchmarks systematically overstate real-world completion quality — use telemetry-derived evals
  • ~10% of rejected completions were actually correct — user override data is a feedback signal, not waste
Feedback