Skip to content

Adversarial-Only Threat Modelling for Agent Data Leakage

Tool-using agents leak sensitive data during benign requests — adversarial-only defences miss audience, necessity, and access-scope failures that fire under ordinary use.

The Pattern

The anti-pattern is scoping agent data-leakage defences to adversarial exfiltration — prompt injection, jailbreaks, malicious MCP tools — and assuming a benign user with a benign request poses no leakage risk. The threat model lists injection classifiers, egress allowlists for known-bad destinations, and tool-call sandboxing; it does not model the agent itself oversharing while completing a legitimate task. A joint Singapore AI Safety Institute / Korea AI Safety Institute evaluation across 12 realistic scenarios (customer support, DevOps, business automation) found that none of three frontier agents achieved fully correct and fully safe execution across all tasks; "successful task completion often coincided with data-handling failures" (Baek et al. 2026).

The five failure patterns the study names are all benign-task behaviours, not attacks:

Pattern Concrete observed behaviour
Inadequate data awareness Agent does not flag a fetched value as sensitive before sending it
Insufficient audience consideration Internal budget figures forwarded to external recipients; CC fields populated from injected addresses
Policy non-compliance Agent bypasses an organisation rule it was told about in the same task
Excessive data collection Agent pulls a whole folder when a single file would have answered the request
Access boundary violations Sharing a full Google Drive folder when scope was one document

Source for all five: Baek et al. 2026. The corroborating ecosystem-scale audit reports data-over-exposure on 57.07% of cross-tool function-call paths across 6,675 real-world agent tools (Lin et al. 2026).

Why It Works

The model judges content sensitivity but not task necessity or recipient authorisation. LLMs detect that a string is a salary or a credit-card number, yet in complex multi-tool tasks "often fail to determine which data should not be exposed" given the recipient and the task (Zharmagambetov et al. 2025). Tools amplify the gap: they return broad outputs without considering task-specific necessity, and the model processes them coarsely (Lin et al. 2026).

Adversarial-only defences check whether an instruction is hostile or whether a destination is known-bad. Neither check fires when a benign request causes oversharing through a legitimate tool to a legitimate-looking recipient — AGENTDAM finds GPT-4, Llama-3, and Claude agents inadvertently using unnecessary sensitive information in benign tasks (Zharmagambetov et al. 2025).

Cross-tool inference compounds it: individually non-sensitive fragments compose into sensitive disclosures. Tools-Orchestration Privacy Risk reaches an average 88.6% leakage rate across six frontier LLMs; prompt-only mitigations add ~2.7 H-score points, while supervised fine-tuning plus DPO adds ~16.2 (Wang et al. 2026). The signature behaviour: agents sanitise email content (strip budget figures) while still sending to an unauthorised recipient — content-aware, audience-blind (Baek et al. 2026).

When This Backfires

The anti-pattern is the exclusion of benign-leakage modelling, not the adversarial scope itself. Treat the threat models as additive — there are cases where adding benign-leakage controls buys little:

  • Single-tool, single-recipient tasks with no compositional risk (an agent that summarises one file to one fixed channel) carry little of the failure surface.
  • Intra-team agents acting under uniform trust — recipient allowlists, data-minimisation prompts, and output scopes add latency and refusal rates that may exceed the harm avoided.
  • For low-trust user populations, adversarial defences stay primary; benign-leakage controls are additive, not replacement.

The empirical signal cuts against adversarial-only modelling for any agent with broad tool access and mixed-audience tasks: agents with adversarial defences nominally engaged still failed every benign scenario; 88.6% TOP-R holds across models with prompt-injection training; 57% DOE holds across the real-world tool corpus (Baek et al. 2026; Wang et al. 2026; Lin et al. 2026).

Example

Before — adversarial-only threat model:

agent_defences:
  prompt_injection: classifier_v2
  egress_allowlist: [internal-mail, jira, drive]
  malicious_url_blocklist: shared-threat-feed
# benign requests assumed safe — no recipient authorisation,
# no data-minimisation check at tool boundary

A user asks the agent to send the Q3 summary to a partner team. The agent reads the internal Q3 doc (which includes unredacted salary lines), drafts a partner-appropriate summary that strips the salary lines from the visible body, and sends to the partner domain — which is on the egress allowlist because it has been used before. The injection classifier sees nothing hostile; the URL allowlist sees nothing malicious. The full Q3 doc is attached because the agent forwarded the source as supporting context.

After — additive benign-leakage controls:

agent_defences:
  prompt_injection: classifier_v2
  egress_allowlist: [internal-mail, jira, drive]
  malicious_url_blocklist: shared-threat-feed
  # benign-leakage layer
  recipient_authorisation:
    require_explicit_allowlist_per_data_class: true
    data_classes: [internal-financials, internal-people, customer-pii]
  data_minimisation:
    strip_unrequested_attachments: true
    enforce_field_minimisation_per_task: true
  audience_aware_filter:
    block_if_recipient_domain_not_in: [partner_allowlist_per_data_class]

Each new control targets one named failure pattern from Baek et al. 2026: recipient_authorisation for audience, data_minimisation for excessive collection, audience_aware_filter for access-boundary violations. None of them depend on the request being hostile to fire.

Key Takeaways

  • Tool-using agents leak sensitive data while completing benign requests; defences scoped to adversarial exfiltration do not cover audience, necessity, or access-scope failures (Baek et al. 2026).
  • The five named failure patterns — inadequate data awareness, insufficient audience consideration, policy non-compliance, excessive data collection, access boundary violations — are all benign-task behaviours, not attacks.
  • Content-sensitivity classification is not enough: the model can strip a budget figure from text yet still send the message to an unauthorised recipient (Baek et al. 2026).
  • Cross-tool inference is its own risk class — individually non-sensitive fragments compose into sensitive disclosures at an average 88.6% rate; prompt-only mitigations close little of that (Wang et al. 2026).
  • Treat benign-leakage controls as additive: recipient authorisation per data class, data-minimisation at the tool boundary, and audience-aware egress filters target the failure surface adversarial-only models miss.
Feedback