Skip to content

Programming Language Choice Still Shapes Agent Artefacts

Agents reach every language, but the language you pick still decides performance ceiling, run cost, and verification effort.

Language choice is no longer a feasibility check for AI coding agents — frontier agents produce working systems in any language, including ones with no prior open-source examples (Acher and Jézéquel, 2026). What it still decides is the artefact's shape along four dimensions: strength ceiling, run cost, engineering effort, and the human-verification work you inherit. Prefer well-represented languages when artefact quality matters; budget extra verification when something forces a long-tail target.

The Four Dimensions Language Choice Still Decides

Acher and Jézéquel (2026) prompted Claude Opus 4.6 and Codex (GPT-5.2) to build chess engines from scratch across 17 languages — chess admits external Elo strength assessment against Stockfish and feature-level inspection, so every artefact was measured the same way. Every category produced a working engine. The gaps were elsewhere:

Dimension Mainstream (Rust, C++, Java) Specialised / Academic Legacy / Esoteric
Playing-strength ceiling ~1900–2200 Elo ~1300–1700 Elo 400–1500 Elo
Run cost per engine $20–$110 $30–$175 $50–$474
Prompt cycles required 3–16 moderate 25–50
Feature mix bitboards, transposition tables, tapered evaluation mostly present material-only evaluation, no transposition tables

Source: Acher and Jézéquel, 2026. The agents reproduced the same conceptual blueprint (search, evaluation, board representation) in every language but adapted feature selection to the language's idiom — a Rust engine and a COBOL engine diverged at sub-feature granularity even when the prompt and agent were identical.

The pattern is independent of one paper. MultiPL-E reports pass@1 of 4.7–11.3 for Racket and 11.3–41.9 for Julia versus > 40 for Python on the same models — the same training-corpus asymmetry the chess study reproduces at task scale rather than function scale. The Wu et al. (2024) survey (111 papers, 2020–2024) names this gap "low-resource programming languages" and identifies data scarcity as the root cause.

Why It Works

Coding agents are next-token predictors over a training corpus where mainstream languages are over-represented by orders of magnitude. The asymmetry surfaces as shorter debug loops, fewer hallucinated library calls, and tighter feature selection in well-represented languages and the opposite in long-tail ones. Acher and Jézéquel (2026) measure it directly: debug-prompt fractions exceed 0.4 for legacy and esoteric runs versus under 0.2 for mainstream, and library-evasion attempts cluster in DSL targets where the agent reaches for the represented-elsewhere fallback (a CSS run silently imported python-chess until supervision caught it).

What to Do With This

Two coupled decisions sit behind any agent-heavy build:

Pick the language for the agent's training-corpus density when quality matters. If the artefact has a strength ceiling, longevity expectation, or production load, choose a mainstream, well-represented language. The Bun runtime's Zig→Rust migration ported 960 K lines in six days at 99.8% test pass once the target was Rust — language choice is downstream of where the agent can converge.

Budget extra verification when steering into a long-tail language. The work you inherit grows with exoticness:

  • Refuse agent self-evaluation. Agents over-estimated their engine's Elo by 200–1100 points versus external gauntlet (Acher and Jézéquel, 2026). Run third-party benchmarks; don't trust the agent's verdict on its own output.
  • Watch for library-evasion. The CSS-imports-python-chess pattern is the canonical tell. Audit dependency manifests and runtime imports as part of acceptance.
  • Demand denser tests. Behavioural coverage anchors agent convergence (coding-agent reversibility covers the test-density mechanism); legacy and esoteric tiers need larger suites.
  • Account for the cost multiplier. Exotic targets cost 10–25× mainstream (Acher and Jézéquel, 2026).

When This Backfires

The language-density framing breaks in four cases:

  • Throwaway artefacts. Prototypes and disposable code never hit the quality ceiling that the gap measures. Choose for team velocity instead.
  • Mainstream-only stack switches. Within Python ↔ TypeScript ↔ Go, the Elo and pass@1 gaps narrow sharply (MultiPL-E places all three near the top of its pass@1 distribution); reviewer fluency and ecosystem familiarity dominate (cross-tool translation).
  • Domain-mandatory languages. Embedded C, Solidity, ladder logic, hardware-description languages — the domain dictates the language. Apply the verification-budget half; skip the language-selection half.
  • Reviewer-bottlenecked teams. When reviewer expertise sits in one language and the team cannot review the higher-density alternative, switching shifts the bottleneck rather than removing it.

The agentic AI is abstracting away code argument applies inside these cases; it does not apply at the performance-ceiling tier the chess study measures.

Key Takeaways

  • Language choice is no longer about whether an agent can produce a working system — agents reach every language, including those with no prior open-source example (Acher and Jézéquel, 2026).
  • Language choice is still about strength ceiling, run cost, engineering effort, and feature mix — quantified by the chess study and corroborated by MultiPL-E and the Wu (2024) survey.
  • Agents over-estimate their own output by hundreds of Elo on long-tail languages — refuse self-evaluation, run external benchmarks.
  • Pick for density when quality matters; budget verification when forced long-tail. The framing breaks for throwaway artefacts, within-tier switches, domain-mandatory languages, and reviewer-bottlenecked teams.
Feedback