The Model Preference Fallacy¶

Models don't have preferences — bare-chat tallies measure prompt framing and training-data distribution, not a stable model "preference."

The Belief¶

A genre of comparison content claims to surface which language, framework, or library a model "prefers." The methodology is consistent: run N prompts in an empty chat window, tabulate which option each model reaches for, publish a heatmap. Practitioners then let those tallies drive real decisions — model routing, stack selection, default scaffolding choices.

The implicit belief is that the tally measures a stable, transferable property of the model. It does not. [Source: Models don't have preferences, they have context]

Why It Fails¶

LLMs are pattern matchers conditioned on the input distribution. Output token probabilities are a function of the prompt context, not of any internal preference state. When a model "prefers" React in an empty chat, its training corpus is React-dominant for the surrounding question tokens — so the maximum-likelihood completion lands on React. Add a Svelte project file and the same model recommends Svelte. Same model, same question, different answer. [Source: Models don't have preferences, they have context]

The variance is not noise. Sclar et al. measured prompt-formatting changes producing performance swings of up to 76 accuracy points on LLaMA-2-13B and ~10 points on average across 50+ tasks — not eliminated by scale, instruction tuning, or few-shot examples. [Source: Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design] Multimodal-model work documents substantial format-driven accuracy variance as a stand-alone phenomenon. [Source: Promptception] ProSA introduces PromptSensiScore, a stand-alone sensitivity metric over semantically equivalent prompts — premised on a single-prompt evaluation being intrinsically misleading. [Source: ProSA] The genre's "what does model X prefer?" tally is precisely the single-prompt evaluation that line of work argues against.

How It Manifests¶

Reading a "model X prefers framework Y" heatmap and routing all framework-Y work to model X, without checking whether the comparison controlled for prompt framing.
Choosing a stack on the basis of which model "likes" it most in a bare-chat survey, then being surprised that the model performs worse than expected once project context is added.
Conflating training-distribution prior (the model has seen more of X) with capability advantage on your project (the model will produce better X code for your codebase).
Citing a sample of bare-chat tallies as evidence in a model-routing decision for a real workload.

A Reading Checklist¶

Before letting a "model X prefers Y" claim drive a routing or stack-selection decision, ask:

Question	What it screens for
What was varied?	If only the model changed and the prompt did not, the result confounds model behavior with prompt-framing artifacts.
What was held fixed?	A defensible comparison fixes the project context (files attached, system prompt, conversation history) across models. A bare-chat comparison fixes nothing meaningful.
Was framing controlled?	Look for explicit prompt-text disclosure and at least one paraphrase or format variant. PromptSensiScore-style variance reporting is the gold standard; one prompt is a red flag. [Source: ProSA]
Sample size and seeds?	Sampling temperature and a small N produce unstable tallies. A 100-prompt comparison without seeds or temperature reporting is closer to a vibe-check than a measurement.
*Does the tally generalize to your* context?**	A bare-chat preference predicts your project outcome only when your project context is empty. If you have real files, the prior gets overridden. [Source: Models don't have preferences, they have context]

Why It Works¶

LLM outputs are conditional probability distributions over tokens given the prompt — there is no separate "preference" state to query. The bare-chat tally captures one slice of that distribution: the prior with minimal conditioning, dominated by training-corpus frequency. Sclar et al.'s 76-point format-driven swing is direct empirical evidence that the surface framing, not a stable preference, is doing the work. [Source: Sclar et al.] Add real project context and you condition on a much richer prefix — the prior is overridden, which is why the same model that "prefers" React in an empty chat recommends Svelte once Svelte files are attached. [Source: Models don't have preferences, they have context]

When This Backfires¶

Hard skepticism of every model-comparison claim imposes its own cost. The fallacy framing applies most strongly when:

You are routing for a real, project-contextual workload. Bare-chat tallies don't predict performance once project files dominate the context.
The candidate models score within a few points on the same prompt format. Within-noise tally differences are not actionable.

It applies less strongly when:

The decision is reversible and low-stakes. A startup choosing between Copilot and Cursor for a 5-person team often costs less by guessing from a tally and switching than by running a full eval. Use the checklist proportionate to the decision's reversibility.
The output mode is strictly constrained. Strict JSON schemas, typed function-calling, and tool-definition modes constrain the response space enough that bare-chat framing variance largely washes out — the structural constraint dominates the preference signal.
The "preference" comes from an evaluation method robust to surface variance. "Flaw or Artifact?" argues that much of the apparent prompt sensitivity in the literature stems from heuristic evaluation methods — log-likelihood scoring and rigid answer matching — that fail to recognise semantically equivalent responses; LLM-as-a-Judge evaluations show reduced variance and more consistent rankings. [Source: Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs] A comparison that uses an LLM judge tolerant to phrasing differences is on firmer ground than a strict-match heatmap.

Example¶

A team reads a heatmap claiming "Model A prefers Python, Model B prefers TypeScript" and routes all their TypeScript work to Model B. The comparison was run in fresh chat windows with no project context attached — a bare-chat tally.

What the team should have asked: the comparison varied only the model and asked "what language would you use to build X?" without attaching any project files. The reported "preference" is the model's training-distribution prior given an empty context — Model B was trained on more TypeScript text, full stop. It says nothing about which model produces better code in the team's existing TypeScript codebase, where the project files dominate the context and override the prior.

What an actual eval would show: when the team runs both models against 20 real PRs from their codebase, Model A — which "prefers" Python in the bare-chat tally — produces fewer type errors and follows the project's barrel-export convention more consistently. The bare-chat tally pointed in the wrong direction because the project context, not the prior, is what determined real performance. For the structured way to run that eval, see benchmark-driven tool selection.

Key Takeaways¶

Bare-chat tallies of what a model reaches for measure prompt framing × training-data distribution, not a stable "preference."
Prompt-formatting changes alone produce performance swings up to 76 accuracy points on standardised benchmarks; one-prompt evaluations are intrinsically misleading.
Before letting a "model X prefers Y" claim drive a routing or stack decision, run the reading checklist: what was varied, what was held fixed, was framing controlled, sample size, and does the result generalize to your context.
Training-distribution priors and project-contextual capability are different things — conflating them produces bad routing and stack choices.
The fallacy framing is strongest for real, project-contextual workloads; it softens for low-stakes reversible decisions and strictly constrained output modes.

Benchmark-Driven Tool Selection for Code Generation — once you've decoded the comparison claim, here's the constructive way to actually evaluate
LLM-Driven Benchmark Auditing — treat the comparison artefact itself as a software artefact and audit it before trusting the number
The Task Framing Irrelevance Fallacy — the same mechanism viewed from the prompt-engineering side: surface framing measurably changes output
The Consistent Capability Fallacy — capability on one task does not predict capability on another, even for the same model
Perceived Model Degradation — another case of mistaking prompt-context effects for stable model properties