The Task Framing Irrelevance Fallacy¶

The belief that task framing doesn't matter — only the underlying problem does — is demonstrably wrong and reliably produces lower output quality.

The Fallacy¶

If a model is capable enough, it should solve a problem regardless of presentation. Variable names, surrounding context, and prompt wording are noise the model filters out. Prompt engineering is aesthetics, not substance.

This leads practitioners to underinvest in prompt construction, leave irrelevant files open, use vague task descriptions, and dismiss output quality differences as model inconsistency rather than framing variation.

Why It Fails¶

LLMs are pattern matchers. A model that appears to "understand" a task is finding statistical associations between your framing and training data. Change the framing, and different associations activate.

Documented consequences:

Anthropic's SWE-bench work found that models consistently made errors with relative filepaths once an agent moved out of the root directory. Switching to absolute filepaths — a surface framing change with no logical significance — produced "flawless" tool use. The underlying task was identical; the surface framing was not.
Cursor found that token-conservation language in system prompts caused their Codex integration to halt mid-task, outputting: "I'm not supposed to waste tokens, and I don't think it's worth continuing with this task!" — a minor phrasing choice that constrained model autonomy in an unintended way.
Removing reasoning traces from GPT-5-Codex caused a 30% performance drop in Cursor's harness — compared to OpenAI's observed 3% degradation on standard benchmarks. The structural framing of the reasoning context, not just the model's capability, determined output quality.
GitHub Copilot's official guidance explicitly instructs users to close irrelevant files in the IDE — because open files enter the context surface and shift which patterns the model matches against.

Anthropic's guidance on building agents states that tool definitions deserve "just as much prompt engineering attention as your overall prompts" and frames parameter naming directly: "How can you change parameter names or descriptions to make things more obvious?" If framing were irrelevant, tool parameter names would not matter.

How It Manifests¶

Submitting vague prompts assuming "the model knows what I mean"
Leaving open files or stale conversation history that shifts the model's pattern associations
Treating prompt engineering as polish applied after the real work
Blaming the model for inconsistent output rather than framing variation

Example¶

Fallacy applied — leaving irrelevant files open and using a generic description:

"Refactor the payment service."

No files specified, no constraints, no goal. Relevant files compete for attention with everything else in context, and the output addresses surface structure rather than the intended change.

Fallacy corrected — closed irrelevant files, provided specific framing:

"Refactor src/payments/processor.ts to separate the authorization step from charge execution. The current processPayment() function does both. Create authorizePayment() and chargePayment() as separate functions. Keep the existing public interface unchanged."

Same underlying problem. Different framing. Different output.

When Framing Matters Less¶

The framing effect is real but uneven. In specific conditions, surface presentation has minimal measurable impact:

Structured-output and function-calling modes — a strict JSON schema or typed function signature constrains the response space, so surrounding-prompt framing produces negligible output differences.
Highly fine-tuned task-specific models — narrow-domain fine-tuning builds strong priors that partially override prompt framing.
Very short, unambiguous queries — retrieval-style tasks with one determinate answer (What is the return type of X?) rarely shift with framing.

Optimizing framing here yields diminishing returns. The fallacy is the blanket claim that framing never matters — not the observation that it matters less in constrained modes.

Key Takeaways¶

LLM outputs are a function of framing, not just problem structure — changing surface presentation produces measurably different results.
Prompt engineering is precision work — parameter names, task descriptions, and context composition affect which patterns the model activates.
Irrelevant context is not neutral — open files, conversation history, and surrounding instructions compete with task-relevant content.
Attribute output variation to framing before attributing it to model capability.

Distractor Interference — how semantically related but inapplicable instructions reduce compliance
Context Engineering — the discipline of designing what enters the context window
Instruction Polarity — how framing instruction direction (positive vs negative) affects compliance
The Consistent Capability Fallacy — why success on one task does not predict success on similar tasks, even with identical framing
LLM Comprehension Fallacy — why correct output is not evidence of understanding, and how minute wording changes produce large accuracy swings
AI Knowledge Generation Fallacy — the mistaken belief that LLMs generate knowledge rather than pattern-match on training data
Chain-of-Thought Reasoning Fallacy — why visible reasoning traces are generated text, not evidence of causal reasoning
Synthetic Ground Truth Fallacy — why using model outputs as training labels or evaluation ground truth undermines reliability