Context Poisoning: When Hallucinations Become Premises¶

Context poisoning is when an early hallucination becomes a trusted premise, and every later step builds confidently on that false foundation.

The Pattern¶

An agent hallucinates an incorrect detail early in a session -- a wrong API signature, a misidentified file, a nonexistent function. The error is not caught. Each subsequent step treats the hallucination as ground truth, compounding the original mistake.

Failure Mode	What Goes Wrong
Context rot (Infinite Context)	Attention degrades as context grows
Objective Drift	Goal lost during summarisation
Distractor Interference	Wrong instruction attended
Context Poisoning	Wrong content treated as fact

Why Detection Is Hard¶

Output remains coherent, confident, and internally consistent. The agent does not hedge or self-correct. Early mistakes trigger a cascade: each subsequent token is predicted from previously generated tokens, so an initial error compounds into a snowball of downstream errors (Chen et al., 2025).

Common Causes¶

Cause	Mechanism
Model hallucination	Wrong API signature generated, then called in later steps
Stale code comments	Outdated comment treated as current behaviour
Contaminated user input	Hidden control characters or contradictory instructions in pasted text
Context overflow	Poisoned content gets disproportionate attention weight (Roo Code)

The Propagation Chain¶

flowchart LR
    A["Step 1: Agent reads codebase"] --> B["Step 2: Hallucinates function signature"]
    B --> C["Step 3: Generates code using wrong signature"]
    C --> D["Step 4: Error output enters context"]
    D --> E["Step 5: Agent 'fixes' by adjusting around the hallucination"]
    E --> F["Step 6: Deeper into wrong solution space"]

    style B fill:#c0392b,color:#fff
    style C fill:#e74c3c,color:#fff
    style D fill:#e74c3c,color:#fff
    style E fill:#e74c3c,color:#fff
    style F fill:#e74c3c,color:#fff

Each step is locally correct. In multi-agent systems the cascade crosses agent boundaries -- one agent's hallucination becomes another's trusted input (Lin et al., 2025).

Example¶

A Claude Code session is tasked with refactoring a payment module. Early in the session, the agent reads the codebase and hallucinates that process_payment() accepts an optional currency parameter. It does not. The agent proceeds to:

Refactor callers to pass currency explicitly
Add currency conversion logic that calls the nonexistent parameter
Write tests that mock the parameter
When tests fail, "fix" by adjusting the mock setup rather than questioning the premise

Forty tool calls deep, the developer reviews a diff full of changes built on a function signature that never existed. Every individual change is internally consistent. The root cause -- a hallucinated parameter in step 1 -- is buried in scroll-back.

Recovery¶

Corrective prompts patch the symptom but the poisoned content remains in context, available to re-activate on the next relevant step. The only reliable fix is a clean context: start a new session and re-anchor with verified ground truth before resuming (Roo Code).

When Mitigation Falls Short¶

Ground-truth checks and evaluator loops reduce context poisoning but do not eliminate it:

Silent hallucinations: A structurally plausible but wrong value passes schema validation and re-reads without flagging.
Multi-agent boundaries: Sub-agents trust the orchestrator's summary; a hallucination there propagates unchallenged.
Context compaction: Summaries can re-inject the original hallucination, resetting the error clock — which is why session partitioning into clean windows beats compacting a poisoned one.

Add human checkpoints at key decision boundaries for high-stakes tasks.

Mitigation¶

Strategy	Mechanism
Ground-truth checks	Re-read the real file each step; do not trust context memory (Anthropic)
Evaluator-optimizer	A second model evaluates output, breaking confirmation bias (Anthropic)
Pre-completion checklists	Middleware enforces verification before completion (LangChain)
Sub-agent isolation	Separate context windows prevent cross-task contamination (FlowHunt)
Externalise results	Write to files; disk is ground truth, context is lossy (FlowHunt)
Poka-yoke tool design	Require absolute paths, reject ambiguous identifiers (Anthropic)
Hard reset	New session rather than correcting within poisoned context (Roo Code)

Key Takeaways¶

A single early hallucination, once it enters context as a "fact," poisons every subsequent step — output stays coherent and confident while the foundation is false.
Detection is hard precisely because the agent never hedges; corrective prompts patch symptoms but the poisoned content lingers and can re-activate.
The reliable fix is a clean context: start a new session and re-anchor on verified ground truth rather than correcting in place.