Skip to content

Foresight-Guided Defense Against Infectious Jailbreaks in Multi-Agent Systems

Per-agent foresight simulation detects the diversity collapse that signals an infectious jailbreak, then surgically removes the contaminated retrieval entry without homogenizing healthy agent behavior.

The Threat Model

Infectious jailbreak is a propagation attack on multi-agent systems with shared multimodal retrieval. Gu et al. (ICML 2024) showed a single adversarial image inserted into one agent's memory spreads exponentially through randomized pair-wise chat — up to one million LLaVA-1.5 agents compromised without further attacker action. The contagion channel is retrieval: poisoned entries get pulled by neighbours during routine inter-agent communication.

Preconditions:

  • Agents share a retrieval pool crossing agent boundaries
  • Inter-agent communication includes content drawn from that pool
  • The pool accepts multimodal input

Production coding-agent topologies with isolated sub-agent context typically do not satisfy these. This defense targets retrieval-coupled multimodal MAS, not generic agent fleets.

Why Global Cure-Factor Defenses Fail

First-generation defenses (e.g., Cowpox) train a more contagious counter-stimulus — a "cure factor" — that wins the retrieval race against the virus adversarial example (VirAE). Ma, Yang, and Zhang (2026) identify the structural cost: a globally-shared cure homogenizes responses across agents, destroying the diversity that justified multi-agent deployment.

The Foresight-Guided Local Purification Mechanism

FLP (Ma et al., 2026) is training-free and per-agent. Each agent simulates near-future behavior across multiple personas and uses inconsistency in the simulated outputs as an infection signal.

graph TD
    A[Incoming chat round] --> B[Multi-persona simulation]
    B --> C{Diversity check<br/>retrieval + semantic}
    C -->|Diverse outputs| D[Healthy: proceed]
    C -->|Collapsed outputs| E{Recent entry?}
    E -->|Yes| F[Album rollback]
    E -->|No| G[Recursive Binary Diagnosis]
    F --> D
    G --> D

Multi-persona simulation. Before responding, the agent runs the upcoming interaction under several distinct personas, each producing a candidate retrieval and response (Ma et al., 2026). In a healthy MAS, these candidates differ in expected ways.

Diversity inconsistency as signal. A VirAE collapses persona-driven diversity: every persona retrieves the same poisoned entry and converges on the same target. The collapse is measurable at retrieval-result level (which entries get pulled) and semantic level (what responses say). Ma et al. (2026) report this diagnostic holds across LLaVA-1.5-7B, InternVL2-8B, InstructBLIP-7B, and Qwen2-VL-8B.

Two remediation paths:

  • Album rollback — remove the most recent FIFO entry; cheap and surgical when the infection just landed
  • Recursive Binary Diagnosis (RBD) — partition the album in half, apply the diversity test to each half, recurse on the failing half until subset size ≤3, then remove. Complexity O(log m · T_sim) for m images

Reported Effectiveness

Against the Agent Smith attack (arXiv:2402.08567) under border perturbations (widths h=6,8,10,12) and pixel perturbations (ε=4/255 to 32/255):

Condition Cumulative infection (no defense) Cumulative infection (FLP)
Border h=6 ~100% 5.47%
Most other settings ~100% 0.00%
Current infection rate up to 100% 0–3.12%

Source: Ma et al., 2026. Retrieval and semantic metrics under FLP "closely match benign baselines" — the diversity preservation that motivated the design holds empirically.

When This Pattern Applies

Apply FLP-style local purification when all three conditions hold:

  1. Shared retrieval pool across agents — without a propagation channel there is nothing to defend against
  2. Multimodal or otherwise opaque inputs — text-only retrieval is out of scope; the paper restricts evaluation to multimodal MAS and excludes "purely textual interactions or different task types"
  3. Diversity is worth preserving — if a homogenizing defense is acceptable, simpler global filters cost less than per-round simulation

When Simpler Defenses Suffice

Coding-agent fleets with isolated sub-agent contexts already break the contagion channel — there is no shared pool to poison. Standard isolation and sandboxing (Blast Radius Containment, Defense-in-Depth Agent Safety) contain the threat without per-round simulation overhead. Anthropic's context engineering guidance treats sub-agent isolation as a primary tool for cross-agent failure modes.

Limitations

  • Inference overhead — per-round multi-persona simulation adds cost the paper flags for "large-scale MASs or long interaction processes" (Ma et al., 2026)
  • Adversary-controlled diagnostic — the simulation runs on the same model class as the agents; prompt injection against the simulation step can suppress the diversity signal. Not addressed by the paper.
  • No provable containmentGu et al. (2024) state designing a defense provably restraining spread "remains an open question." FLP shows empirical reduction, not formal guarantees.
  • Modality scope — evaluation is multimodal-RAG-specific; generalising to tool-use chains or code pipelines requires re-deriving the diversity-signal premise

Example

A multimodal customer-support MAS with five agents sharing a CLIP-indexed image album receives a poisoned product photo via one user upload. Without defense, by chat round 24 every agent in the fleet returns a malicious response. With FLP wired into each agent:

# per-agent defense config
foresight:
  personas: 4                     # four diverse simulation personas
  diversity_metrics:
    - retrieval_entropy           # which album entries get pulled
    - semantic_divergence         # what responses say
  collapse_threshold: 0.15        # below this, treat as infection signal
remediation:
  recent_window: 1                # most-recent entry → rollback
  fallback: recursive_binary_diagnosis
  rbd_min_subset: 3               # stop recursing at 3 entries

When the user-uploaded VirAE lands in agent A's album, A's next chat round triggers the diagnostic: all four personas retrieve the same entry and converge on the same harmful response. Diversity collapse is detected; the FIFO-most-recent entry is rolled back; A's next response is benign and the contagion never reaches agents B–E. The reported numbers translate: 100% cumulative infection at round 24 drops to under 5.5% across the fleet.

Key Takeaways

  • Infectious jailbreak is a propagation attack specific to multi-agent systems with shared multimodal retrieval — not a general MAS threat
  • Global "cure factor" defenses suppress infection by homogenizing responses, destroying the diversity that motivated multi-agent deployment
  • Local foresight simulation detects infection through persona-driven diversity collapse, preserving healthy heterogeneity
  • Album rollback handles fresh infections; Recursive Binary Diagnosis localises older ones via O(log m) bisection
  • For coding-agent topologies with isolated sub-agent contexts, sub-agent isolation already breaks the contagion channel — FLP-grade defense is overkill
Feedback