Context Poisoning: When Hallucinations Become Premises¶
Context poisoning is when an early hallucination becomes a trusted premise, and every later step builds confidently on that false foundation.
The Pattern¶
An agent hallucinates an incorrect detail early in a session -- a wrong API signature, a misidentified file, a nonexistent function. The error is not caught. Each subsequent step treats the hallucination as ground truth, compounding the original mistake.
How It Differs from Related Failures¶
| Failure Mode | What Goes Wrong |
|---|---|
| Context rot (Infinite Context) | Attention degrades as context grows |
| Objective Drift | Goal lost during summarisation |
| Distractor Interference | Wrong instruction attended |
| Context Poisoning | Wrong content treated as fact |
Why Detection Is Hard¶
Output remains coherent, confident, and internally consistent. The agent does not hedge or self-correct. Early mistakes trigger a cascade: each subsequent token is predicted from previously generated tokens, so an initial error compounds into a snowball of downstream errors (Chen et al., 2025).
Common Causes¶
| Cause | Mechanism |
|---|---|
| Model hallucination | Wrong API signature generated, then called in later steps |
| Stale code comments | Outdated comment treated as current behaviour |
| Contaminated user input | Hidden control characters or contradictory instructions in pasted text |
| Context overflow | Poisoned content gets disproportionate attention weight (Roo Code) |
The Propagation Chain¶
flowchart LR
A["Step 1: Agent reads codebase"] --> B["Step 2: Hallucinates function signature"]
B --> C["Step 3: Generates code using wrong signature"]
C --> D["Step 4: Error output enters context"]
D --> E["Step 5: Agent 'fixes' by adjusting around the hallucination"]
E --> F["Step 6: Deeper into wrong solution space"]
style B fill:#c0392b,color:#fff
style C fill:#e74c3c,color:#fff
style D fill:#e74c3c,color:#fff
style E fill:#e74c3c,color:#fff
style F fill:#e74c3c,color:#fff
Each step is locally correct. In multi-agent systems the cascade crosses agent boundaries -- one agent's hallucination becomes another's trusted input (Lin et al., 2025).
Example¶
A Claude Code session is tasked with refactoring a payment module. Early in the session, the agent reads the codebase and hallucinates that process_payment() accepts an optional currency parameter. It does not. The agent proceeds to:
- Refactor callers to pass
currencyexplicitly - Add currency conversion logic that calls the nonexistent parameter
- Write tests that mock the parameter
- When tests fail, "fix" by adjusting the mock setup rather than questioning the premise
Forty tool calls deep, the developer reviews a diff full of changes built on a function signature that never existed. Every individual change is internally consistent. The root cause -- a hallucinated parameter in step 1 -- is buried in scroll-back.
Recovery¶
Corrective prompts patch the symptom but the poisoned content remains in context, available to re-activate on the next relevant step. The only reliable fix is a clean context: start a new session and re-anchor with verified ground truth before resuming (Roo Code).
When Mitigation Falls Short¶
Ground-truth checks and evaluator loops reduce context poisoning but do not eliminate it:
- Silent hallucinations: A structurally plausible but wrong value passes schema validation and re-reads without flagging.
- Multi-agent boundaries: Sub-agents trust the orchestrator's summary; a hallucination there propagates unchallenged.
- Context compaction: Summaries can re-inject the original hallucination, resetting the error clock — which is why session partitioning into clean windows beats compacting a poisoned one.
Add human checkpoints at key decision boundaries for high-stakes tasks.
Mitigation¶
| Strategy | Mechanism |
|---|---|
| Ground-truth checks | Re-read the real file each step; do not trust context memory (Anthropic) |
| Evaluator-optimizer | A second model evaluates output, breaking confirmation bias (Anthropic) |
| Pre-completion checklists | Middleware enforces verification before completion (LangChain) |
| Sub-agent isolation | Separate context windows prevent cross-task contamination (FlowHunt) |
| Externalise results | Write to files; disk is ground truth, context is lossy (FlowHunt) |
| Poka-yoke tool design | Require absolute paths, reject ambiguous identifiers (Anthropic) |
| Hard reset | New session rather than correcting within poisoned context (Roo Code) |
Key Takeaways¶
- A single early hallucination, once it enters context as a "fact," poisons every subsequent step — output stays coherent and confident while the foundation is false.
- Detection is hard precisely because the agent never hedges; corrective prompts patch symptoms but the poisoned content lingers and can re-activate.
- The reliable fix is a clean context: start a new session and re-anchor on verified ground truth rather than correcting in place.