Training-Data Gravity: Agents Default to Deprecated APIs¶

Coding agents reach for deprecated APIs because pretraining-corpus frequency outweighs current docs; injecting current information narrows the gap but never closes it.

The Pattern¶

Coding agents ship stale code confidently — GenerativeModel() instead of the current google-genai client, npx create-react-app instead of Vite, an aws s3api flag retired two releases ago, a Pydantic v1 @validator inside a v2 project. The agent is not guessing; it is sampling from a prior where the old API has years of Stack Overflow answers and the replacement has a handful of release notes (Microsoft for Developers, Competing Against Yourself, 2026).

This is distinct from three adjacent failures. It is not Boring Technology Bias, which is recommendation bias when asked "what should I use?" — gravity fires even when the choice is implicit. It is not Pattern Replication Risk, where bias comes from in-repo examples. It is not Unversioned Scaffolding Commands Pull Stale Templates, a resolver fallback at scaffold time. Training-data gravity is generation-time bias on every call, with no codebase context, no scaffolder, and no recommendation question required.

The Failure Surface¶

Deprecated APIs. Across seven LLMs evaluated over 145 API mappings from eight popular Python libraries and 28,125 completion prompts, every model struggled to avoid deprecated-API usage, attributed to the presence of deprecated examples in training data and the absence of deprecation knowledge (Wang et al., ICSE'25, arxiv 2406.09834).
Version-specific idioms. GitChameleon 2.0 (328 Python completion problems conditioned on library version) puts enterprise frontier models at only 48–51% baseline success (Misra et al., arxiv 2507.12367).
Superseded libraries with generic names. Semantic collapse: a replacement reusing a generic label ("v2 CLI", "the new client") gets collapsed into the predecessor's concept and the model picks the one with more training signal. Distinctively-named replacements (Vite, Bun, Deno, Astro) carve their own slot and avoid the collision (Microsoft for Developers, 2026).

Why It Works (the mechanism)¶

Each generation samples from the model's pretrained conditional distribution, which is dominated by relative frequency in the pretraining corpus — a decade-old API with thousands of Stack Overflow answers receives more probability mass than a one-year-old replacement, even when training-time docs named the old API deprecated (Wang et al., 2024). Prompt conditioning shifts the distribution only within the support the prior already assigns non-trivial mass: the context-memory conflict finding measures the residual. Across 270 real-world API updates over 8 Python libraries and 11 models in 4 families, only 42.55% of generations executed without good docs in context; structured docs plus larger models raised that to 66.36% — not 100% (Ashik et al., arxiv 2604.09515). The reasoning trace exposes the mechanism on the surface: agents explicitly consider the new option, weigh the evidence, and decide for the established one — "I'm wondering if a standalone CLI tool has been released, but the standard approach remains using [established tool]" (Microsoft for Developers, 2026). It is the same prior-amplification shape documented for optimisation loops in Prior Dominance Over Feedback: feedback refines within the prior's support; it does not replace it.

Counter-Measures¶

Three controls move the needle, all rooted in shifting the conditional distribution at generation time rather than hoping the model overrides itself.

Inject current docs at the call site. RAG against the library's current API reference improves LLM performance on less-common libraries by 83% to ~220%, with example code contributing more than descriptive text or parameter lists (When LLMs Meet API Documentation, arxiv 2503.15231). Pin docs into the harness's retrieval surface for every library the project uses; relying on the model's parametric knowledge is the failure mode.
Encode deprecations as machine-readable steering. Add explicit "DEPRECATED: Do not use X. Use Y instead" rules to the project's instruction file ecosystem for every replacement the team cares about. The Microsoft writeup makes this the load-bearing control for newly-released tools (Microsoft for Developers, 2026).
Validate after generation. A linter, type checker, or pre-commit hook that flags known-deprecated calls catches the residual cases that retrieval and steering miss. APILOT — a system that maintains a continuously-updated outdated-API dataset and gates generation against it — reduces outdated-code recommendations by 89.42% on average while improving usability by 27.54% (Bai et al., arxiv 2409.16526). The take-away is that the deprecated-API set has to be a live artifact, not a frozen rule pack.

When This Backfires¶

Counter-measures cost prompt budget, docs maintenance, and pre-commit latency. They are wasted or net-negative when:

APIs are stable and slow-moving. POSIX, standard SQL, BSD sockets — the popular pattern is the current pattern. Steering for an API that has not changed in a decade ages the prompt for no benefit.
The replacement has a distinctive name and its own training mass. Bun, Deno, Vite, and Astro carved their own conceptual slot at launch; the model picks them unprompted and explicit "prefer X over Y" steering becomes dead weight (Microsoft for Developers, 2026).
The downstream verifier is fast and complete. When code is executed immediately and a failing call surfaces in a test or runtime error within one round, the gravity matters only when nothing re-checks.
The codebase is internal or proprietary. A DSL the model has never seen has no prior to fight — example coverage matters, anti-gravity steering does not.
Pinning the entire stack is a category error. As Boring Technology Bias §When This Backfires notes, blanket pinning trades selection bias for lock-in and stale guidance. Steer the specific deprecation transitions the team cares about; do not pin every dependency.

Example¶

Before — relying on the model's parametric knowledge:

# Prompt: "Call Gemini to summarise this document."
import google.generativeai as genai

genai.configure(api_key=API_KEY)
model = genai.GenerativeModel("gemini-1.5-pro")
response = model.generate_content(text)

The google-generativeai package was superseded by google-genai; the GenerativeModel() pattern still dominates training-data examples (googleapis/python-genai#1606, via Boring Technology Bias). The call executes, but the project is now coupled to a deprecated client and the agent will reproduce the pattern in every new file it touches.

After — current docs injected, deprecation made explicit, validator on the path:

# AGENTS.md (project root)

## Deprecations

- DEPRECATED: `google.generativeai` (the `google-generativeai` package). Use `google.genai` (the `google-genai` package) instead — see https://googleapis.github.io/python-genai/.
- DEPRECATED: `GenerativeModel()`. Use `genai.Client().models.generate_content(...)`.

# .pre-commit-config.yaml
repos:
  - repo: local
    hooks:
      - id: no-deprecated-genai
        name: Block deprecated google-generativeai imports
        language: pygrep
        entry: 'import google\.generativeai|from google\.generativeai'
        types: [python]

The three controls compose: current docs in the harness's retrieval, an explicit deprecation rule the agent reads on every turn, and a deterministic gate that fails the commit when the prior leaks through anyway.

Key Takeaways¶

Pretraining-corpus frequency dominates the model's conditional distribution over APIs and tools; the bias fires on every generation, not only on recommendation prompts (Wang et al., 2024; Microsoft for Developers, 2026).
Current-information injection narrows the gap — RAG over API docs improves performance 83–220% — but context-memory conflict means the prior still leaks; executable rates plateau around 66% even with structured docs and larger models (Ashik et al., arxiv 2604.09515; arxiv 2503.15231).
The three load-bearing counter-measures compose: inject current docs at the call site, encode deprecations as machine-readable steering, validate after generation against a live outdated-API set (APILOT, arxiv 2409.16526).
Semantic collapse — generic replacement names colliding with the predecessor's concept — amplifies the bias; distinctive names (Vite, Bun, Deno) avoid it at design time (Microsoft for Developers, 2026).

Boring Technology Bias — Sibling failure: bias toward popular tools when asked to recommend; training-data gravity covers the same prior at generation time, on every call.
Pattern Replication Risk — In-repo examples amplify whatever pattern the codebase contains; compounds training-data gravity once a deprecated call lands.
Prior Dominance Over Feedback — The same prior-amplification shape inside propose-evaluate-revise loops: feedback refines within the prior's support, it does not replace it.
Unversioned Scaffolding Commands Pull Stale Templates — Adjacent staleness mode at scaffold time via the npm resolver, distinct from the always-on generation-time bias here.
Stale AI Configuration Artifacts (Context Rot) — The companion failure on the steering side: if the deprecation rules themselves drift, the counter-measure stops working.