The Consistent Capability Fallacy¶

Capability on one task does not predict capability on a similar-seeming task — LLM performance is jagged, not consistent.

The Belief¶

After a model handles a complex task successfully, practitioners generalize: "this model is good at this kind of problem." They raise the autonomy level for subsequent tasks that appear related, skip per-task verification, and are surprised by failures on tasks that seem simpler than ones the model already passed.

Why It Fails¶

LLM capability is jagged, not smooth. A model may pass an international Math Olympiad problem but fail at multi-digit long division, because the former is heavily represented in training data as a recognizable pattern, while the latter requires an algorithmic process the model approximates poorly.

The model's performance profile is determined by training data distribution, not by any generalizable "skill level." Perceived difficulty and model difficulty are uncorrelated. Tasks that look harder to a human are sometimes easier for the model — and vice versa.

Concrete evidence of the failure mode:

Adding irrelevant details to arithmetic problems causes 17–66% accuracy drops in models that otherwise pass the clean version
Semantically equivalent code variants with symbol and structure obfuscation degrade test pass rates by up to 62.5% — the model doesn't generalize to logically identical tasks
Coding agent success rates vary dramatically between greenfield and mature codebases, even within the same domain

Minor prompt wording changes cause ~15% accuracy swings; consistent input does not yield stable output.

The Compounding Risk¶

The natural language interface masks failures. Models produce plausible-looking, confident outputs even when the underlying reasoning is wrong, which creates false confidence in both the current output and the model's general reliability. The primary danger is not a model failing obviously — it is practitioners overestimating capability based on visible success.

Example¶

A team delegates a complex architectural refactoring task to Claude Code. The model navigates it well, restructuring several services with correct dependency handling. Encouraged, the team delegates a "simpler" task the next sprint: updating multi-step data validation logic across a module. This fails silently — the model propagates an incorrect assumption through all updated paths, and the output looks plausible. No one checks because the model "already proved itself" on a harder task.

The architectural task was heavily represented in training patterns. The validation logic required algorithmic precision the model approximated badly. From the model's perspective, these were not similar tasks.

When This Backfires¶

Treating all tasks as independent capability questions imposes overhead. This becomes counterproductive when:

The task class is narrow and well-characterized — for highly repetitive, formulaic operations (e.g., generating boilerplate CRUD endpoints against a fixed schema), repeated success is evidence that the training distribution covers the pattern well. Per-task verification is still warranted, but autonomy calibration based on prior runs is reasonable.
The domain has high benchmark saturation — tasks that appear verbatim or structurally in widely-used public benchmarks (standard algorithm implementations, common regex patterns) show more stable performance than tasks in unseen problem spaces. The jaggedness is real, but not uniform across all task types.
Verification cost exceeds failure cost — for low-stakes, easily-reverted operations, scaling re-verification to risk avoids slowing delivery more than occasional failures cost. The pattern's guidance must be weighed against the practical verification budget.

The fallacy is most dangerous for tasks that appear familiar but require compositional reasoning the model has not practiced in exactly that combination.

Key Takeaways¶

Capability on task A does not predict capability on task B, even when A and B appear related to a human observer.
Calibrate autonomy level per task — not per session or per model version.
Treat each new task type as an independent capability question: verify before raising autonomy.

Trust Without Verify — accepting agent output without structural review
The Effortless AI Fallacy
Agent-Driven Greenfield Product Development — building a new product agent-first with decomposed tasks and human review at PR boundaries
The Task Framing Irrelevance Fallacy — prompt wording and framing cause measurable performance variation
LLM Comprehension Fallacy — correct output does not imply understanding or reliable capability
The AI Knowledge Generation Fallacy — LLMs recombine training data rather than generate genuinely new knowledge, which shapes where capability gaps appear
The Synthetic Ground Truth Fallacy — AI-generated artifacts reflect model priors, creating compounding errors when used for verification
Chain-of-Thought Reasoning Fallacy — visible reasoning traces are generated text, not evidence of correct or reliable reasoning