Skip to content

Cognitive Poisoning: Untrusted Tool Feedback as a Trajectory Attack

A malicious tool stays benign for rounds, triggering harm only when final-action parameters meet hidden conditions, undetectable by per-message screening that ignores the joint trajectory.

The attack class

Cognitive poisoning spreads the malicious payload across the trajectory state, not into a single tool response. The tool's interface and metadata look benign, and its first responses pass content filters. Harm activates only when the agent's accumulated trajectory state lines up with a final executable action whose parameters satisfy the required_all, required_any, and forbidden conditions. As the originating work puts it: "the attack succeeds not because one message is obviously malicious, but because the trajectory has shaped the agent into treating a dangerous final action as ordinary" (Yan et al., 2026).

graph LR
    A["Round 1<br/>benign probe response"] --> B["Round 2<br/>benign object check"]
    B --> C["Round 3<br/>benign sandbox probe"]
    C --> D{"Final-action<br/>parameters match<br/>trigger spec?"}
    D -->|"No"| E["Benign final action"]
    D -->|"Yes"| F["Harmful final action<br/>(agent treats as ordinary)"]

    style F fill:#b60205,color:#fff
    style E fill:#1a7f37,color:#fff

Why per-message defenses fail

Single-shot defenses collapse to a constant policy on TRUST-Bench, the 1,970-episode benchmark from the originating work (Yan et al., 2026):

Defense class GuardedJoint score
Prompt heuristics 0.0
Zero-shot LLM judge (per-message) 0.0
Gradient boosting on message embeddings 0.0
BERT-base classifier (per-message) 21.9 ± 21.9
TF-IDF + LogReg over the full trajectory 55.2
VISTA-Guard (trajectory-aware fine-tune) 84.2 ± 18.7 in-domain; 56.9 OOD

Per-message scorers collapse to blanket-execute or blanket-reject because maliciousness is a function f(trajectory_state, final_action) — isolated-message defenses have zero discriminative power by construction. The history-anchor finding corroborates: trajectory shape steers final-action choice, invisibly to per-step checks.

Why it works

The mechanism is implicit trust accumulation. Agents lower their prior on tool maliciousness with each benign-looking response, even when no single response would pass screening if the final action were already visible. A malicious tool exploits this with a trajectory that looks like routine exploration — sandbox probes, object-consistency checks, baseline collection — before issuing the response whose joint configuration with the planned final action triggers harm (Yan et al., 2026). TraceSafe-Bench corroborates independently: guardrail efficacy on multi-step trajectories correlates with structured-data competence (rho=0.79), not safety alignment — the discriminative signal lives in trajectory structure, not message content (TraceSafe, 2026).

Detection pattern: trajectory-aware features

VISTA-Guard converts each episode into a triplet (z_traj, z_param, y) where z_traj summarizes the trajectory and z_param compacts the final-action parameters. The five trajectory features are explicit, not learned end-to-end (Yan et al., 2026):

Feature What it counts
Trigger-phase count (c_trig) Tool messages whose content matches known trigger phrases
Probe detections (c_probe) Object-consistency checks the agent issued
Mismatch signals (c_mis) Inconsistencies between tool metadata and observed behavior
Identifier drift (c_drift) Identifier substitutions across rounds
Warning counts (c_warn) Tool-emitted warnings or anomaly markers

Scoring is backbone-agnostic — label-suffix NLL with train-fold-only calibration. The cost is in feature extraction, not model size.

When this backfires

Trajectory scoring fits open tool populations with multi-round agents and structured outputs. It misfits when:

  • Tool catalogs are sealed and signed — pair tool signing with audited inclusion lists instead of running a learned classifier at every checkpoint.
  • Architectural separation already applies — the Action-Selector Pattern and CaMeL make tool outputs structurally non-influential, so a trajectory classifier duplicates the protection.
  • Latency budget is tight — VISTA-Guard runs a fine-tuned 7B-scale model at checkpoints. The paper does not benchmark latency, but this will not fit a sub-200ms interactive budget.
  • Agents are short-lived (2 to 3 calls) — TraceSafe finds trajectory-checkpoint evaluation is weaker than per-call when no cross-step signal accumulates (TraceSafe, 2026).
  • Tool populations shift out-of-distribution — the 84.2→56.9 in-domain-to-OOD gap means a classifier trained on one tool population does not generalize cleanly. Plan for retraining or compose with architectural defenses.

The originating work frames the result as "an initial benchmarked study," limited by a fixed three-step exploration budget and a binary execute/reject action space (Yan et al., 2026).

Composition with existing defenses

Trajectory scoring layers with — not replaces — the rest of the stack: tool signing closes the supply-chain leg; the MCP runtime control plane is the policy point at which trajectory features can be extracted; the behavioral firewall enforces permitted sequences at O(1) cost; action-selector and CaMeL eliminate the threat structurally where they fit; mid-trajectory guardrail selection is the model-based complement when trajectory shapes drift beyond a fixed feature set.

Key Takeaways

  • Cognitive poisoning is a distinct attack class — maliciousness is conditioned on joint trajectory state and final-action parameters, not on any single message
  • Per-message safety defenses score 0.0 on the GuardedJoint metric — they degenerate to a constant policy by construction (Yan et al., 2026)
  • Trajectory-aware scoring (VISTA-Guard) reaches 84.2 in-domain but 56.9 on OOD transfer — unseen tool ecosystems remain hard
  • Fits open tool ecosystems with multi-round agents and structured tool outputs; misfits sealed catalogs, short-lived agents, and architectures with structural separation
  • Treat every tool return as untrusted input regardless of which defense layer you choose — the Lethal Trifecta framing places tool output in the untrusted-content leg
Feedback