Programming Language Choice Still Shapes Agent Artefacts¶
Agents reach every language, but the language you pick still decides performance ceiling, run cost, and verification effort.
Language choice is no longer a feasibility check for AI coding agents — frontier agents produce working systems in any language, including ones with no prior open-source examples (Acher and Jézéquel, 2026). What it still decides is the artefact's shape along four dimensions: strength ceiling, run cost, engineering effort, and the human-verification work you inherit. Prefer well-represented languages when artefact quality matters; budget extra verification when something forces a long-tail target.
The Four Dimensions Language Choice Still Decides¶
Acher and Jézéquel (2026) prompted Claude Opus 4.6 and Codex (GPT-5.2) to build chess engines from scratch across 17 languages — chess admits external Elo strength assessment against Stockfish and feature-level inspection, so every artefact was measured the same way. Every category produced a working engine. The gaps were elsewhere:
| Dimension | Mainstream (Rust, C++, Java) | Specialised / Academic | Legacy / Esoteric |
|---|---|---|---|
| Playing-strength ceiling | ~1900–2200 Elo | ~1300–1700 Elo | 400–1500 Elo |
| Run cost per engine | $20–$110 | $30–$175 | $50–$474 |
| Prompt cycles required | 3–16 | moderate | 25–50 |
| Feature mix | bitboards, transposition tables, tapered evaluation | mostly present | material-only evaluation, no transposition tables |
Source: Acher and Jézéquel, 2026. The agents reproduced the same conceptual blueprint (search, evaluation, board representation) in every language but adapted feature selection to the language's idiom — a Rust engine and a COBOL engine diverged at sub-feature granularity even when the prompt and agent were identical.
The pattern is independent of one paper. MultiPL-E reports pass@1 of 4.7–11.3 for Racket and 11.3–41.9 for Julia versus > 40 for Python on the same models — the same training-corpus asymmetry the chess study reproduces at task scale rather than function scale. The Wu et al. (2024) survey (111 papers, 2020–2024) names this gap "low-resource programming languages" and identifies data scarcity as the root cause.
Why It Works¶
Coding agents are next-token predictors over a training corpus where mainstream languages are over-represented by orders of magnitude. The asymmetry surfaces as shorter debug loops, fewer hallucinated library calls, and tighter feature selection in well-represented languages and the opposite in long-tail ones. Acher and Jézéquel (2026) measure it directly: debug-prompt fractions exceed 0.4 for legacy and esoteric runs versus under 0.2 for mainstream, and library-evasion attempts cluster in DSL targets where the agent reaches for the represented-elsewhere fallback (a CSS run silently imported python-chess until supervision caught it).
What to Do With This¶
Two coupled decisions sit behind any agent-heavy build:
Pick the language for the agent's training-corpus density when quality matters. If the artefact has a strength ceiling, longevity expectation, or production load, choose a mainstream, well-represented language. The Bun runtime's Zig→Rust migration ported 960 K lines in six days at 99.8% test pass once the target was Rust — language choice is downstream of where the agent can converge.
Budget extra verification when steering into a long-tail language. The work you inherit grows with exoticness:
- Refuse agent self-evaluation. Agents over-estimated their engine's Elo by 200–1100 points versus external gauntlet (Acher and Jézéquel, 2026). Run third-party benchmarks; don't trust the agent's verdict on its own output.
- Watch for library-evasion. The CSS-imports-
python-chesspattern is the canonical tell. Audit dependency manifests and runtime imports as part of acceptance. - Demand denser tests. Behavioural coverage anchors agent convergence (coding-agent reversibility covers the test-density mechanism); legacy and esoteric tiers need larger suites.
- Account for the cost multiplier. Exotic targets cost 10–25× mainstream (Acher and Jézéquel, 2026).
When This Backfires¶
The language-density framing breaks in four cases:
- Throwaway artefacts. Prototypes and disposable code never hit the quality ceiling that the gap measures. Choose for team velocity instead.
- Mainstream-only stack switches. Within Python ↔ TypeScript ↔ Go, the Elo and pass@1 gaps narrow sharply (MultiPL-E places all three near the top of its pass@1 distribution); reviewer fluency and ecosystem familiarity dominate (cross-tool translation).
- Domain-mandatory languages. Embedded C, Solidity, ladder logic, hardware-description languages — the domain dictates the language. Apply the verification-budget half; skip the language-selection half.
- Reviewer-bottlenecked teams. When reviewer expertise sits in one language and the team cannot review the higher-density alternative, switching shifts the bottleneck rather than removing it.
The agentic AI is abstracting away code argument applies inside these cases; it does not apply at the performance-ceiling tier the chess study measures.
Key Takeaways¶
- Language choice is no longer about whether an agent can produce a working system — agents reach every language, including those with no prior open-source example (Acher and Jézéquel, 2026).
- Language choice is still about strength ceiling, run cost, engineering effort, and feature mix — quantified by the chess study and corroborated by MultiPL-E and the Wu (2024) survey.
- Agents over-estimate their own output by hundreds of Elo on long-tail languages — refuse self-evaluation, run external benchmarks.
- Pick for density when quality matters; budget verification when forced long-tail. The framing breaks for throwaway artefacts, within-tier switches, domain-mandatory languages, and reviewer-bottlenecked teams.
Related¶
- Coding-Agent Reversibility: Platform Choice as a Two-Way Door — the migration-decision twin; behavioural test coverage is the binding constraint when porting between languages.
- Cross-Tool Translation: Learning from Multiple AI Assistants — when team velocity dominates the language-density edge.
- Strategy Over Code Generation — artefact-shaping decisions sit upstream of agent speed.
- Suggestion Gating: Why Fewer AI Completions Improve Developer Experience — gating lower-density outputs is the same shape as steering away from low-resource languages.