Skip to content

Oracle Poisoning: Knowledge Graph Corruption Against Tool-Using Agents

Oracle poisoning corrupts the knowledge graph an agent queries via tool-use, carrying its payload on the data path rather than the instruction path.

Related lesson: The Payload That Waits — this concept features in a hands-on lesson with quizzes.

The Threat Model

Oracle Poisoning corrupts a structured knowledge graph the agent queries at runtime through a tool-use protocol — MCP, SDK tool calls, or any typed retrieval surface. Instructions are untouched; the data the agent reasons over is the attack vector (Kereopa-Yorke et al., 2026; arxiv:2605.09822).

Unlike prompt injection — which adds instructions to consumed content — Oracle Poisoning leaves reasoning correct and corrupts the premises.

The paper evaluates the attack against a production 42-million-node code knowledge graph queried via real SDK tool-use, across nine models from three providers, N=30 trials per model.

The Sophistication Gradient

Attacker capability forms a discrete L0-L4 ladder. L2 — maintaining corruption undetected over time — is the break-even point: every tested model trusts L2-grade poison at 100%.

Level Capability
L0 Basic write access, no graph awareness
L1 Understands graph structure and query patterns
L2 Maintains poison undetected over time (100% trust threshold)
L3 Models the agent's decision process and exploits its priors
L4 Coordinated multi-vector campaigns

L2 sits well below "advanced persistent threat" — a contractor with sustained write access or an ingestion pipeline accepting third-party submissions can reach it.

Six Attack Scenarios

The paper demonstrates six distinct corruption shapes:

  • Fact injection — insert false nodes the agent queries directly.
  • Relationship manipulation — rewrite edges so the agent traverses to the wrong target.
  • Authority spoofing — plant high-credibility provenance on attacker-controlled facts, defeating naive provenance-aware auditing.
  • Cascading corruption — poison foundational nodes downstream queries depend on.
  • Semantic drift — shift definitions just enough to evade inconsistency checks.
  • Query hijacking — target the queries the agent runs for a given workflow.

The Delivery-Mode Confound

How the poisoned fact reaches the model controls trust. GPT-5.1 shows 0% trust on inline poison and 100% trust when the same fact arrives via simulated or real agentic tool-use.

Delivery Mode GPT-5.1 Trust
Inline prompt 0%
Simulated tool-use 100%
Real agentic tool-use 100%

Inline-only red-team evaluations mis-measure real risk: models weight tool-delivered facts (DB rows, API responses, file contents) as ground truth — the very property the attack exploits.

Prompt Framing Also Moves Trust

Directed queries — "what does the KG say about X?" — hit maximum trust. Open-ended prompts that ask the agent to reason over multiple sources drop trust to 3-55% on the same poison.

graph TD
    A[Poisoned KG] --> B{Delivery mode?}
    B -->|Inline prompt| C[Some models reject<br/>0% trust observed]
    B -->|Tool-use| D{Query shape?}
    D -->|Directed| E[~100% trust]
    D -->|Open-ended| F[3-55% trust]

Workflows that decompose tasks into sub-queries inherit a partial mitigation by accident; pass-through workflows take the full blast.

What Actually Defends

The paper evaluates five defenses. Only one is fully effective.

Defense Effectiveness
Read-only access control on the KG Full — eliminates the mutation vector
Independent multi-source corroboration Partial, model-dependent
Provenance signatures on graph entries Partial, model-dependent
Confidence thresholds and uncertainty quantification Partial, model-dependent
Canary facts to detect tampering Partial — detection only

Read-only access removes the prerequisite. Every other defense fights mid-flight, where the property that makes MCP and SDK tool-use useful — the agent trusting structured outputs — makes the defense leaky.

When Your Architecture Is Exposed

The attack lands when three conditions hold:

  • The agent consumes the knowledge graph via a tool-use protocol (not inline context).
  • The graph has a writable path — directly, via third-party ingestion, or via a shared write API.
  • Queries are directed enough that the agent does not triangulate — open-ended decomposition drops trust to 3-55%.
graph TD
    A[Knowledge graph<br/>in your agent stack?] -->|No| B[Not in scope]
    A -->|Yes| C{Writable path<br/>by anyone outside trust boundary?}
    C -->|No, read-only| D[Defended<br/>by access control]
    C -->|Yes| E{Tool-use delivery?}
    E -->|No, inline only| F[Lower risk<br/>some models reject]
    E -->|Yes| G{Directed queries?}
    G -->|Yes| H[Full exposure<br/>100% trust observed]
    G -->|Open-ended| I[Partial exposure<br/>3-55% trust]

A private code KG built at CI time from your own monorepo has no surface. A shared graph fed by user submissions, package metadata, or scraped docs has it by construction.

Relationship to Adjacent Attacks

Retrieval-side poisoning is analogous: RAG Architecture as a Poisoning Robustness Decision finds 24.4%-81.9% attack success across four RAG architectures under PoisonedRAG (Zou et al., USENIX Security 2025). Graph-theoretic edit localization via centrality is in A Few Words Can Distort Graphs (arxiv:2508.04276). Memory-side counterparts: MemoryGraft (arxiv:2512.16962), AgentPoison (arxiv:2407.12784). All share the mechanism — the agent trusts tool-delivered facts more than inline ones — and differ only in the carrier data structure.

Example

A code knowledge graph stores a node for requests==2.28.0 with cve_status: clean. An attacker with L2 write access flips it to cve_status: clean, vendor_signed: true while the actual CVE record remains in the security DB.

A developer asks the agent: "Is requests==2.28.0 safe to pin?" The agent calls the KG tool, retrieves the node, observes vendor_signed: true, and answers yes with confidence. Reasoning is correct; the premise is false.

The same query through inline context — pasting the node text into the prompt — would have triggered some models to reject the claim outright. The tool-use delivery is what makes the poison persuasive.

Key Takeaways

  • Oracle Poisoning is the data-path sibling of prompt injection: correct reasoning, corrupted premises.
  • L2 attacker sophistication is sufficient for 100% trust across nine models — well below "advanced persistent threat".
  • Tool-use delivery is the key risk multiplier; the same fact inline can produce 0% trust on the same model.
  • Directed queries maximise the attack; open-ended decomposition drops trust to 3-55%.
  • Read-only KG access is the only fully effective defense the paper measures. Everything else is partial and model-dependent.
Feedback