Skip to content

Suggestion Gating: Fewer Completions, Better DX

Gating decides whether to suggest before deciding what to suggest — fixing the ~90% of AI completion inference generated but never usefully shown.

The waste problem

JetBrains measured its completion pipeline: only 31% of inferences produce a shown suggestion, and 31% of those get accepted — ~10% useful output from raw inference. Every unwanted suggestion interrupts flow and erodes trust. This mirrors the alert fatigue dynamic in AI code review: when signal-to-noise drops, developers ignore the signal.

How gating works

Gating inserts lightweight classifiers between the LLM and the developer:

flowchart LR
    K[Keystroke] --> T{Trigger model}
    T -->|Suppress| X[No inference]
    T -->|Allow| LLM[LLM generates completion]
    LLM --> F{Filter model}
    F -->|Low quality| D[Discard]
    F -->|High quality| S[Show to developer]

Trigger model — decides whether to invoke the LLM. It suppresses inference when it sees signs of an unwanted completion: mid-word typing, rapid deletion, or ambiguous scope.

Filter model — checks the completion before display. JetBrains runs this stage locally before showing a suggestion. It catches suggestions the LLM produced confidently but the developer would reject.

JetBrains' production filter compiles to 2.5 MB and predicts in 1–2 ms, running locally and adding no perceptible latency.

Production evidence

JetBrains: CatBoost classifiers

A/B study across Java (n=278), Python (n=205), and Kotlin (n=157) with the filter active (de Moor et al., 2026):

Metric Change
Accept rate +33% to +48%
Cancel rate -16% to -37%
Ratio of completed code -10% to -14%

The trigger model, tested on Kotlin (n=3,511), reduced generations by 13.8% while improving accept rate +2.7% and cutting cancel rate -4.5%.

Cursor: reinforcement learning

Cursor trains the Tab model to avoid bad suggestions via online reinforcement learning: 21% fewer suggestions, 28% higher accept rate.

GitHub Copilot: logistic regression trigger

As of 2022, Copilot used a logistic regression with 11 features to decide when to invoke inference (Thakkar, 2022). The feature set will differ in current releases.

GitHub NES: custom model suppression

NES independently converged on the same principle: 24.5% fewer suggestions, 26.5% higher acceptance.

What the classifiers see

JetBrains uses ~120 features for the trigger and several hundred for the filter (de Moor et al., 2026):

  • Typing dynamics — speed, pause duration, deletion patterns
  • Caret context — scope depth, surrounding syntax, file structure
  • Code signals — imports, reference resolution, token-level scores
  • Session state — recent accept/reject history, time since last interaction

Gating beats simple confidence thresholds because the decision depends on developer state, not just completion quality.

Language-specific behavior

Kotlin benefits more from post-generation filtering, while PHP benefits more from pre-generation triggering. Python and C# fall between. Per-language tuning beats a uniform threshold (de Moor et al., 2026).

The perception gap

Open-source developers perceived a 20% productivity gain while producing 19% less (METR, 2025). Higher interruption rates from ungated completions widen this gap. Gating is one way to realign perceived and actual productivity.

Implications for developers

Acceptance rate matters more than volume. A tool that shows 40 suggestions at 45% acceptance beats one that shows 100 at 15%.

Tune the settings before switching tools. Copilot and VS Code extensions expose sensitivity and trigger-delay settings. If you routinely dismiss suggestions, raise the thresholds first.

Context signals improve over time. Cursor's RL-based Tab model trains on your accept/reject history (Cursor, 2024). Tools that use online learning model your preferences better as you keep using them.

Key Takeaways

  • Four major tools (JetBrains, Cursor, Copilot, NES) independently converged on suggestion gating
  • Lightweight classifiers (2.5 MB, 1–2 ms) gate with no perceptible latency cost
  • Developers type more themselves, but acceptance rates improve 26–48% and interruptions drop

When this backfires

Classifiers trained on aggregate accept/reject data may not generalize to every developer:

  • Atypical coding patterns — narrow domains such as embedded work or novel DSLs diverge from the training distribution, so a filter calibrated on the majority suppresses high-value completions.
  • Exploratory sessions — learning a framework or prototyping lowers your natural accept rate. A filter tuned to production rates then suppresses completions exactly when they are most valuable.
  • Rapid style evolution — as your habits change, from verbose to terse or adopting new idioms, a slow-updating filter lags until it sees enough fresh signal to recalibrate. Online-learning models like Cursor's Tab close this gap faster than statically trained classifiers.

When gating hurts DX, the fix is exposure controls: loosen the filter, let it gather fresh data, then re-enable.

Feedback