Skip to content

The Consistent Capability Fallacy

Capability on one task does not predict capability on a similar-seeming task — LLM performance is jagged, not consistent.

The Belief

After a model handles a complex task successfully, practitioners generalize: "this model is good at this kind of problem." They raise the autonomy level for subsequent tasks that appear related, skip per-task verification, and are surprised by failures on tasks that seem simpler than ones the model already passed.

Why It Fails

LLM capability is jagged, not smooth. A model may pass an international Math Olympiad problem but fail at multi-digit long division, because the former is heavily represented in training data as a recognizable pattern, while the latter requires an algorithmic process the model approximates poorly.

The model's performance profile is determined by training data distribution, not by any generalizable "skill level." Perceived difficulty and model difficulty are uncorrelated. Tasks that look harder to a human are sometimes easier for the model — and vice versa.

Concrete evidence of the failure mode:

Minor prompt wording changes cause ~15% accuracy swings; consistent input does not yield stable output.

The Compounding Risk

The natural language interface masks failures. Models produce plausible-looking, confident outputs even when the underlying reasoning is wrong, which creates false confidence in both the current output and the model's general reliability. The primary danger is not a model failing obviously — it is practitioners overestimating capability based on visible success.

Example

A team delegates a complex architectural refactoring task to Claude Code. The model navigates it well, restructuring several services with correct dependency handling. Encouraged, the team delegates a "simpler" task the next sprint: updating multi-step data validation logic across a module. This fails silently — the model propagates an incorrect assumption through all updated paths, and the output looks plausible. No one checks because the model "already proved itself" on a harder task.

The architectural task was heavily represented in training patterns. The validation logic required algorithmic precision the model approximated badly. From the model's perspective, these were not similar tasks.

When This Backfires

Treating all tasks as independent capability questions imposes overhead. This becomes counterproductive when:

  • The task class is narrow and well-characterized — for highly repetitive, formulaic operations (e.g., generating boilerplate CRUD endpoints against a fixed schema), repeated success is evidence that the training distribution covers the pattern well. Per-task verification is still warranted, but autonomy calibration based on prior runs is reasonable.
  • The domain has high benchmark saturation — tasks that appear verbatim or structurally in widely-used public benchmarks (standard algorithm implementations, common regex patterns) show more stable performance than tasks in unseen problem spaces. The jaggedness is real, but not uniform across all task types.
  • Verification cost exceeds failure cost — for low-stakes, easily-reverted operations, scaling re-verification to risk avoids slowing delivery more than occasional failures cost. The pattern's guidance must be weighed against the practical verification budget.

The fallacy is most dangerous for tasks that appear familiar but require compositional reasoning the model has not practiced in exactly that combination.

Key Takeaways

  • Capability on task A does not predict capability on task B, even when A and B appear related to a human observer.
  • Calibrate autonomy level per task — not per session or per model version.
  • Treat each new task type as an independent capability question: verify before raising autonomy.
Feedback