Skip to content

LLM-Pinned Library Versions Carry Systemic CVE Exposure

LLM-pinned library versions routinely carry known CVEs because the model's prior favors popular-but-outdated releases; a bias shared across models, so pin against external CVE data.

The Finding

Wang et al. (May 2026) evaluated 10 LLMs on PinTrace, a 1,000-task Python benchmark drawn from Stack Overflow, checking every generated requirements.txt, pyproject.toml, and inline pip install against the National Vulnerability Database (arXiv:2605.06279):

  • 36.70%-55.70% of tasks include at least one library at a version with a known CVE.
  • All ten models converge on the same small set of risky releases — the failure is systemic, not per-model.

Convergence rules out the "use a better model" remediation. The risk lives in the training distribution, not the weights.

How Often Models Specify Versions

Specification rate depends on what the model is asked to produce (arXiv:2605.06279):

Surface Version-specified rate
Direct prompt ("install X to do Y") 26.83%-95.18%
Manifest file (requirements.txt, pyproject.toml) 6.45%-59.19%

Manifest files — the surface that actually controls reproducible installs — get versions least often. A study of Developer-ChatGPT conversations found version constraints in only 9% of exchanges, almost always at the user's prompting (arXiv:2401.16340).

Severity and Disclosure Cutoff

When the model does pin a version, the CVE distribution is heavy on the dangerous end (arXiv:2605.06279):

Property Range across 10 models
CVEs rated Critical or High 62.75%-74.51%
CVEs disclosed before the model's training cutoff 72.27%-91.37%

The cutoff result is load-bearing: most vulnerable versions were public in CVE databases before training. The model picked them anyway because its prior reflects historical co-occurrence in the corpus, not current vulnerability state.

The Versions Often Don't Even Install

Functional compatibility tracks vulnerability incidence — the same prior that picks vulnerable versions picks broken ones (arXiv:2605.06279):

Check Pass rate range
Static install (pip install succeeds) 19.70%-63.20%
Dynamic functional test (code runs and matches expected behaviour) 6.49%-48.62%

A version string that fails to install is loud and self-correcting. One that installs but carries CVE-2023-XXXXX is silent.

Why Models Converge on Risky Versions

graph LR
    Train["Training corpus<br/>(Stack Overflow, blogs)"] --> Prior["Co-occurrence prior<br/>(library, version-string)"]
    Prior --> Output["Pinned version<br/>at generation time"]
    NVD["NVD / CVE feed<br/>(updates after cutoff)"] -. "no signal path" .-> Prior

    style NVD fill:#fee,stroke:#c33
    style Output fill:#ffd,stroke:#cc6

The model learns a co-occurrence prior over (library, version-string) pairs. Stack Overflow answers and blogs overrepresent the version current when the popular answer was written. The CVE feed has no signal path into this prior, so every model trained on the same corpus inherits the same bias — no prompt engineering reshapes the underlying statistics. The fix must come from outside the model.

What to Change

Externally anchored version constraints reduce both vulnerability exposure and compatibility failure (arXiv:2605.06279). Every effective anchor routes around the model's prior:

  • CVE-aware lookup at install time. Run pip-audit, npm audit, or Dependabot security updates as a blocking CI gate; the agent's manifest becomes a hint validated against current CVE state.
  • Curated allowlist or internal mirror. Artifactory or Nexus filters block known-vulnerable versions at install time, so the agent's pin is dead-on-arrival if it points at a blocked release.
  • Auto-bump after merge. Pair with Renovate or Dependabot so safe-at-merge versions get bumped as new CVEs land.
  • Lock-then-resolve workflow. Pipe the agent's requirements.txt through pip-compile, uv lock, or poetry lock in a clean environment — the same workflow that closes the missing-dependency gap (Dependency Gap Validation) surfaces vulnerable transitive pulls.

Example

An agent generates a Flask scraper and writes:

# requirements.txt
flask==2.0.3
requests==2.25.1
pyyaml==5.4

All three install cleanly. Static checks pass. CI is green.

$ pip-audit -r requirements.txt
Found 4 known vulnerabilities in 3 packages
Name     Version  ID                  Fix Versions
-------- -------- ------------------- ------------
flask    2.0.3    PYSEC-2023-62       2.2.5,2.3.2
requests 2.25.1   GHSA-j8r2-6x86-q33q 2.31.0
requests 2.25.1   GHSA-9wx4-h78v-vm56 2.32.0
pyyaml   5.4      GHSA-8q59-q68h-6hv4 5.4 (fixed via patch)

The model picked the version of each library that dominates Stack Overflow tutorials from 2021-2022. Each carries a publicly-disclosed CVE that landed before the model's training cutoff. A blocking pip-audit step in CI surfaces all four in seconds; the agent (or a follow-up bump PR) rewrites to flask==2.3.2, requests==2.32.0, pyyaml==6.0.1 and the pipeline continues. The commit message and the test suite would never have caught this.

When This Backfires

  • Throwaway prototypes. A CVE-database step adds latency for code that will never leave a laptop.
  • Already-locked monorepos. When pip-compile / uv lock / poetry lock already runs in CI, the agent's pin is a hint resolved against existing lockfile policy; a second LLM-side check duplicates work.
  • Air-gapped or curated mirrors. When Artifactory or Nexus already blocks vulnerable versions at install time, an agent-side step is redundant.
  • Mature canonical libraries. For requests, numpy, pandas, the bias toward popular versions often selects safe-enough releases; CVE exposure concentrates in the long tail.

Key Takeaways

  • 36.7%-55.7% of LLM-specified library versions carry a known CVE; 62.75%-74.51% of those CVEs are Critical or High severity (arXiv:2605.06279)
  • 72.27%-91.37% of the vulnerable versions had CVEs disclosed before the model's training cutoff — the prior is permanently behind the threat landscape
  • All 10 models tested converge on the same risky releases — this is a training-distribution effect, not a per-model defect; switching models does not help
  • Treat agent-written manifests as a hint, not a source of truth — validate against pip-audit, Dependabot, or a curated mirror as a blocking CI gate
  • 19.70%-63.20% of pinned versions even fail to install — the same prior produces both insecure and broken pins, so install-time checks catch both classes

Sources

Feedback