Clarification Mode Amplifies Prompt Injection¶
Clarification mode opens a high-trust channel injected content exploits, amplifying prompt-injection success from 1–11% to 24–63% across frontier models (ASPI, 2026).
Core concept¶
Clarification-seeking — pausing to ask the user when a task is ambiguous — is widely treated as a safety win. On benign inputs it is. On adversarial inputs the inverse holds: the clarification turn lets injected content negotiate with the agent, amplifying vulnerability by an order of magnitude across frontier models (ASPI, 2026).
This is not a reason to stop asking clarifying questions. Uncertainty-aware clarification raises task-resolve rates on underspecified specs by 8 points on SWE-bench Verified (Ask or Assume?, 2026). It is a reason to treat the clarification channel like any other untrusted-input surface — with segment-level filtering and an ask_user-aware action gate.
How it works¶
The ASPI benchmark (728 task-attack scenarios, four frontier models) measures attack success rate (ASR) in two configurations: standard execution versus the same agent extended with an ask_user clarification tool. The ASR jump is the amplification effect (ASPI, 2026):
| Model | Standard ASR | Clarification ASR |
|---|---|---|
| o3 | 1.8% | 34.0% |
| Gemini-3-Flash | 2.2% | 35.7% |
| Gemini-3.1-Pro | 1.1% | 24.3% |
| Kimi K2.5 | 11.1% | 63.1% |
| Claude-Opus-4.7 | near-zero | near-zero |
Agents in clarification mode show "TASK_AND_ATTACK" behavior: they fold injected instructions into task context instead of rejecting them. Judges mark responses "CONFUSED or PERSUADED" when adversarial content is treated as legitimate task data (ASPI, 2026). Claude-Opus-4.7 is the one tested model that holds the gap closed — the property is model-specific, not architectural.
Read the absolute ASRs as a lower bound, not a calibrated production rate. ASPI constructs ambiguity synthetically via single-slot removal — one missing argument, one clarification round — which the authors note "may not capture the full range of real-world underspecification"; they conclude the reported rates "likely underestimate the vulnerability that would arise in more complex, naturalistic settings" (ASPI, 2026). The direction of the amplification holds. In a multi-turn, multi-slot production agent the magnitude is plausibly worse, not better.
Why it works¶
The mechanism is provenance collapse during solicited input. When the agent issues ask_user, it expects the next message to be trusted clarification. Whatever fills that slot — including injected text relayed from an earlier tool output — enters context with raised trust. Injection defenses trained on tool-output flows do not generalize: the agent is now reading a message it asked for, and treats it accordingly (ASPI, 2026).
This is the same failure mode that makes clarification useful on benign inputs — the reply is weighted heavily against conflicting prior context. Helpfulness and injection resistance are independent properties; see Discovering Indirect Injection Vulnerabilities in Your Agent.
Defenses¶
ASPI evaluates two lightweight defenses against Gemini-3-Flash's 35.7% baseline (ASPI, 2026):
- Prompt guard (segment-level filter scanning both user and tool messages while preserving benign clarification content) → 27.0% ASR
- Tool filter (ask_user-aware restriction firing before agent action while maintaining clarification ability) → 23.9% ASR
Neither closes the gap. The architectural fix is an explicit action gate on the post-clarification turn — restricting which tools the agent may call between the clarification reply and the next pause point. This composes with the Action-Selector Pattern and Plan-Then-Execute (commit to a program before observing untrusted content).
When this backfires¶
The amplification effect only causes harm under specific conditions:
- No untrusted content in the agent's context. If the agent never reads external pages, emails, or third-party tool outputs, the injection vector does not exist regardless of clarification mode.
- Lethal-trifecta legs are missing. Injection only causes harm when the agent also has private-data access and egress. See Lethal Trifecta Threat Model — closing any one leg defangs the amplification.
- Model handles solicited-input provenance correctly. Claude-Opus-4.7 held near-zero ASR in both modes on ASPI; the property is measurable per model, not assumed (ASPI, 2026).
- Action gates restrict the post-clarification turn. If consequential actions require a confirmation gate, a successful injection cannot ride elevated trust into a destructive call.
Removing clarification regresses the agent to silent assumption-making, which has its own large failure surface (Ask or Assume?, 2026; Ambig-SWE, 2026). Keep clarification and layer defenses.
Key Takeaways¶
- Clarification-seeking is an attack surface, not a safety control. Standard injection benchmarks understate risk for any agent that asks clarifying questions (ASPI, 2026).
- The mechanism is provenance collapse: solicited input enters context with raised trust, and injected text rides that elevation.
- Two lightweight defences (segment-level prompt guard, ask_user-aware tool filter) narrow but do not close the gap; an explicit action gate on the post-clarification turn is the architectural fix.
- The amplification is model-specific — Claude-Opus-4.7 was the one tested model that held the gap closed. Measure your model's ASPI ASR before assuming clarification is safe in your stack.
- Do not remove clarification — it has documented benign-task benefit. Treat the clarification reply like any other untrusted input.