Skip to content

Semantic Collapse Under Underspecified Prompts

An underspecified prompt does not make a coding model guess diversely — it collapses onto one confident, wrong interpretation that resampling cannot expose.

What it looks like

You hand a coding model an ambiguous or incomplete task. To check whether the spec was clear, you sample the model a few times and inspect the spread. The outputs agree, so you conclude the task was well specified and ship. In fact the model resolved the ambiguity the wrong way, and every sample repeated the same wrong choice. You read low variance as clarity; it was collapse.

The failure has a name: detrimental semantic collapse (Richter & Papadakis, 2026). A model collapses when repeated samples fall into a single semantic equivalence class. Collapse is benign when that class is correct and detrimental when it is wrong — and a detrimental collapse "provides no signal of incorrectness or underspecification." Before any ambiguity is injected, roughly 3% of HumanEval tasks, 10–16% of MBPP tasks, and 18–32% of LiveCodeBench tasks already collapse detrimentally (Richter & Papadakis, 2026).

Why collapse hides the ambiguity

The model does not treat an ambiguous spec as a branch point. It resolves the ambiguity deterministically toward one interpretation, then generates against it. Sampling perturbs surface tokens but lands in the same meaning class, so the output variance that disambiguation methods depend on never appears (Richter & Papadakis, 2026). More sampling does not rescue you: even at 25 samples per task, 23.8% to 61.6% of tasks stayed collapsed, and reaching 95% confidence in collapse detection would need about 300 samples — over $391 per MBPP run.

Because the wrong interpretation is internally consistent, it compiles, passes the tests the model wrote for it, and reads as confident. This is the upstream cause of assumption propagation: the collapse sets the wrong premise, and later work builds on it.

When this backfires

Reading variance as a signal is not always wrong; the trap is specific to underspecified coding tasks where priors dominate. Sampling-based checks stay useful when:

  • The model genuinely branches. In free-form question answering, entropy over meaning-clustered samples (semantic entropy) does detect uncertainty (Kuhn et al., 2023). Variance is informative where the model actually spreads probability.
  • Any interpretation is acceptable. When the ambiguity is a true "don't-care," collapse is harmless — only about 3% of HumanEval tasks collapse detrimentally, so specifying every prompt to the letter wastes effort.
  • The code is low-stakes or throwaway. A wrong interpretation you can discard in seconds is cheaper than a specify-and-validate loop on every task.

The recommendation — specify intent explicitly and validate behavior against it — earns its cost when a wrong direction is expensive to unwind, not on every keystroke.

Example

Before — sample agreement read as a clear spec:

Task: "Return the top results, sorted." A developer samples the model five times. All five return results sorted descending by score and truncated to ten. The agreement is read as confidence that the spec was unambiguous, and the code ships.

The spec never said ten, never said descending, never said by score. The model collapsed onto one reading of "top" and "sorted" and repeated it across every sample. The consumer expected the top three by recency. No amount of resampling would have surfaced the gap.

After — specify intent, then validate behavior:

Pin the intent in the prompt ("top 3 by recency; ascending is fine") and assert the expected behavior with a spec-derived test, rather than inferring clarity from sample agreement.

Key Takeaways

  • Low output variance is not evidence of a clear spec — an underspecified coding prompt usually collapses onto one confident, wrong interpretation instead of diverging.
  • Detrimental semantic collapse leaves no signal: the wrong reading compiles, passes model-written tests, and survives heavy resampling (23.8% to 61.6% of tasks still collapsed at 25 samples).
  • Sampling for variance still detects uncertainty where the model genuinely branches, such as free-form question answering with semantic entropy — the blind spot is underspecified code.
  • Specify intent explicitly and validate behavior against a spec-derived check; do not trust self-consistency to expose ambiguity.
Feedback