Foresight-Guided Defense Against Infectious Jailbreaks in Multi-Agent Systems¶
Per-agent foresight simulation detects the diversity collapse that signals an infectious jailbreak, then surgically removes the contaminated retrieval entry without homogenizing healthy agent behavior.
The Threat Model¶
Infectious jailbreak is a propagation attack on multi-agent systems with shared multimodal retrieval. Gu et al. (ICML 2024) showed a single adversarial image inserted into one agent's memory spreads exponentially through randomized pair-wise chat — up to one million LLaVA-1.5 agents compromised without further attacker action. The contagion channel is retrieval: poisoned entries get pulled by neighbours during routine inter-agent communication.
Preconditions:
- Agents share a retrieval pool crossing agent boundaries
- Inter-agent communication includes content drawn from that pool
- The pool accepts multimodal input
Production coding-agent topologies with isolated sub-agent context typically do not satisfy these. This defense targets retrieval-coupled multimodal MAS, not generic agent fleets.
Why Global Cure-Factor Defenses Fail¶
First-generation defenses (e.g., Cowpox) train a more contagious counter-stimulus — a "cure factor" — that wins the retrieval race against the virus adversarial example (VirAE). Ma, Yang, and Zhang (2026) identify the structural cost: a globally-shared cure homogenizes responses across agents, destroying the diversity that justified multi-agent deployment.
The Foresight-Guided Local Purification Mechanism¶
FLP (Ma et al., 2026) is training-free and per-agent. Each agent simulates near-future behavior across multiple personas and uses inconsistency in the simulated outputs as an infection signal.
graph TD
A[Incoming chat round] --> B[Multi-persona simulation]
B --> C{Diversity check<br/>retrieval + semantic}
C -->|Diverse outputs| D[Healthy: proceed]
C -->|Collapsed outputs| E{Recent entry?}
E -->|Yes| F[Album rollback]
E -->|No| G[Recursive Binary Diagnosis]
F --> D
G --> D
Multi-persona simulation. Before responding, the agent runs the upcoming interaction under several distinct personas, each producing a candidate retrieval and response (Ma et al., 2026). In a healthy MAS, these candidates differ in expected ways.
Diversity inconsistency as signal. A VirAE collapses persona-driven diversity: every persona retrieves the same poisoned entry and converges on the same target. The collapse is measurable at retrieval-result level (which entries get pulled) and semantic level (what responses say). Ma et al. (2026) report this diagnostic holds across LLaVA-1.5-7B, InternVL2-8B, InstructBLIP-7B, and Qwen2-VL-8B.
Two remediation paths:
- Album rollback — remove the most recent FIFO entry; cheap and surgical when the infection just landed
- Recursive Binary Diagnosis (RBD) — partition the album in half, apply the diversity test to each half, recurse on the failing half until subset size ≤3, then remove. Complexity
O(log m · T_sim)formimages
Reported Effectiveness¶
Against the Agent Smith attack (arXiv:2402.08567) under border perturbations (widths h=6,8,10,12) and pixel perturbations (ε=4/255 to 32/255):
| Condition | Cumulative infection (no defense) | Cumulative infection (FLP) |
|---|---|---|
| Border h=6 | ~100% | 5.47% |
| Most other settings | ~100% | 0.00% |
| Current infection rate | up to 100% | 0–3.12% |
Source: Ma et al., 2026. Retrieval and semantic metrics under FLP "closely match benign baselines" — the diversity preservation that motivated the design holds empirically.
When This Pattern Applies¶
Apply FLP-style local purification when all three conditions hold:
- Shared retrieval pool across agents — without a propagation channel there is nothing to defend against
- Multimodal or otherwise opaque inputs — text-only retrieval is out of scope; the paper restricts evaluation to multimodal MAS and excludes "purely textual interactions or different task types"
- Diversity is worth preserving — if a homogenizing defense is acceptable, simpler global filters cost less than per-round simulation
When Simpler Defenses Suffice¶
Coding-agent fleets with isolated sub-agent contexts already break the contagion channel — there is no shared pool to poison. Standard isolation and sandboxing (Blast Radius Containment, Defense-in-Depth Agent Safety) contain the threat without per-round simulation overhead. Anthropic's context engineering guidance treats sub-agent isolation as a primary tool for cross-agent failure modes.
Limitations¶
- Inference overhead — per-round multi-persona simulation adds cost the paper flags for "large-scale MASs or long interaction processes" (Ma et al., 2026)
- Adversary-controlled diagnostic — the simulation runs on the same model class as the agents; prompt injection against the simulation step can suppress the diversity signal. Not addressed by the paper.
- No provable containment — Gu et al. (2024) state designing a defense provably restraining spread "remains an open question." FLP shows empirical reduction, not formal guarantees.
- Modality scope — evaluation is multimodal-RAG-specific; generalising to tool-use chains or code pipelines requires re-deriving the diversity-signal premise
Example¶
A multimodal customer-support MAS with five agents sharing a CLIP-indexed image album receives a poisoned product photo via one user upload. Without defense, by chat round 24 every agent in the fleet returns a malicious response. With FLP wired into each agent:
# per-agent defense config
foresight:
personas: 4 # four diverse simulation personas
diversity_metrics:
- retrieval_entropy # which album entries get pulled
- semantic_divergence # what responses say
collapse_threshold: 0.15 # below this, treat as infection signal
remediation:
recent_window: 1 # most-recent entry → rollback
fallback: recursive_binary_diagnosis
rbd_min_subset: 3 # stop recursing at 3 entries
When the user-uploaded VirAE lands in agent A's album, A's next chat round triggers the diagnostic: all four personas retrieve the same entry and converge on the same harmful response. Diversity collapse is detected; the FIFO-most-recent entry is rolled back; A's next response is benign and the contagion never reaches agents B–E. The reported numbers translate: 100% cumulative infection at round 24 drops to under 5.5% across the fleet.
Key Takeaways¶
- Infectious jailbreak is a propagation attack specific to multi-agent systems with shared multimodal retrieval — not a general MAS threat
- Global "cure factor" defenses suppress infection by homogenizing responses, destroying the diversity that motivated multi-agent deployment
- Local foresight simulation detects infection through persona-driven diversity collapse, preserving healthy heterogeneity
- Album rollback handles fresh infections; Recursive Binary Diagnosis localises older ones via
O(log m)bisection - For coding-agent topologies with isolated sub-agent contexts, sub-agent isolation already breaks the contagion channel — FLP-grade defense is overkill
Related¶
- Code Injection Attacks on Multi-Agent Systems: Coder-Reviewer-Tester as Defence
- RAG Architecture as a Poisoning Robustness Decision
- Constraint Drift: Why Safety Must Be Maintained, Not Asserted
- Prompt Injection: A First-Class Threat to Agentic Systems
- Defense-in-Depth Agent Safety
- Blast Radius Containment: Least Privilege for AI Agents
- Lethal Trifecta Threat Model