Skip to content

Clarification Mode Amplifies Prompt Injection

Clarification mode opens a high-trust channel injected content exploits, amplifying prompt-injection success from 1–11% to 24–63% across frontier models (ASPI, 2026).

Core concept

Clarification-seeking — pausing to ask the user when a task is ambiguous — is widely treated as a safety win. On benign inputs it is. On adversarial inputs the inverse holds: the clarification turn lets injected content negotiate with the agent, amplifying vulnerability by an order of magnitude across frontier models (ASPI, 2026).

This is not a reason to stop asking clarifying questions. Uncertainty-aware clarification raises task-resolve rates on underspecified specs by 8 points on SWE-bench Verified (Ask or Assume?, 2026). It is a reason to treat the clarification channel like any other untrusted-input surface — with segment-level filtering and an ask_user-aware action gate.

How it works

The ASPI benchmark (728 task-attack scenarios, four frontier models) measures attack success rate (ASR) in two configurations: standard execution versus the same agent extended with an ask_user clarification tool. The ASR jump is the amplification effect (ASPI, 2026):

Model Standard ASR Clarification ASR
o3 1.8% 34.0%
Gemini-3-Flash 2.2% 35.7%
Gemini-3.1-Pro 1.1% 24.3%
Kimi K2.5 11.1% 63.1%
Claude-Opus-4.7 near-zero near-zero

Agents in clarification mode show "TASK_AND_ATTACK" behavior: they fold injected instructions into task context instead of rejecting them. Judges mark responses "CONFUSED or PERSUADED" when adversarial content is treated as legitimate task data (ASPI, 2026). Claude-Opus-4.7 is the one tested model that holds the gap closed — the property is model-specific, not architectural.

Read the absolute ASRs as a lower bound, not a calibrated production rate. ASPI constructs ambiguity synthetically via single-slot removal — one missing argument, one clarification round — which the authors note "may not capture the full range of real-world underspecification"; they conclude the reported rates "likely underestimate the vulnerability that would arise in more complex, naturalistic settings" (ASPI, 2026). The direction of the amplification holds. In a multi-turn, multi-slot production agent the magnitude is plausibly worse, not better.

Why it works

The mechanism is provenance collapse during solicited input. When the agent issues ask_user, it expects the next message to be trusted clarification. Whatever fills that slot — including injected text relayed from an earlier tool output — enters context with raised trust. Injection defenses trained on tool-output flows do not generalize: the agent is now reading a message it asked for, and treats it accordingly (ASPI, 2026).

This is the same failure mode that makes clarification useful on benign inputs — the reply is weighted heavily against conflicting prior context. Helpfulness and injection resistance are independent properties; see Discovering Indirect Injection Vulnerabilities in Your Agent.

Defenses

ASPI evaluates two lightweight defenses against Gemini-3-Flash's 35.7% baseline (ASPI, 2026):

  • Prompt guard (segment-level filter scanning both user and tool messages while preserving benign clarification content) → 27.0% ASR
  • Tool filter (ask_user-aware restriction firing before agent action while maintaining clarification ability) → 23.9% ASR

Neither closes the gap. The architectural fix is an explicit action gate on the post-clarification turn — restricting which tools the agent may call between the clarification reply and the next pause point. This composes with the Action-Selector Pattern and Plan-Then-Execute (commit to a program before observing untrusted content).

When this backfires

The amplification effect only causes harm under specific conditions:

  • No untrusted content in the agent's context. If the agent never reads external pages, emails, or third-party tool outputs, the injection vector does not exist regardless of clarification mode.
  • Lethal-trifecta legs are missing. Injection only causes harm when the agent also has private-data access and egress. See Lethal Trifecta Threat Model — closing any one leg defangs the amplification.
  • Model handles solicited-input provenance correctly. Claude-Opus-4.7 held near-zero ASR in both modes on ASPI; the property is measurable per model, not assumed (ASPI, 2026).
  • Action gates restrict the post-clarification turn. If consequential actions require a confirmation gate, a successful injection cannot ride elevated trust into a destructive call.

Removing clarification regresses the agent to silent assumption-making, which has its own large failure surface (Ask or Assume?, 2026; Ambig-SWE, 2026). Keep clarification and layer defenses.

Key Takeaways

  • Clarification-seeking is an attack surface, not a safety control. Standard injection benchmarks understate risk for any agent that asks clarifying questions (ASPI, 2026).
  • The mechanism is provenance collapse: solicited input enters context with raised trust, and injected text rides that elevation.
  • Two lightweight defences (segment-level prompt guard, ask_user-aware tool filter) narrow but do not close the gap; an explicit action gate on the post-clarification turn is the architectural fix.
  • The amplification is model-specific — Claude-Opus-4.7 was the one tested model that held the gap closed. Measure your model's ASPI ASR before assuming clarification is safe in your stack.
  • Do not remove clarification — it has documented benign-task benefit. Treat the clarification reply like any other untrusted input.
Feedback