Skip to content

Dynamic Tool Fetching Breaks KV Cache

Loading tool definitions dynamically per step seems like good context management but destroys the single most impactful cost optimization available: prompt caching.

The Intuition Trap

Fewer tools means fewer tokens, so fetching only needed tools per step seems optimal. It is not: savings from removing tools are dwarfed by breaking prompt cache continuity.

Why It Fails

Tool definitions sit at the top of the cache hierarchy. The prefix is computed in order: toolssystemmessages. Any change to tool definitions invalidates every subsequent level.

graph LR
    A["tools (top)"] --> B["system"] --> C["messages"]
    style A fill:#d32f2f,color:#fff
    style B fill:#f57c00,color:#fff
    style C fill:#fbc02d,color:#000

Cached tokens cost 10x less than uncached — Claude Sonnet 4's cache-read rate is $0.30/MTok against a $3/MTok base input (Anthropic prompt caching). A single cache break per turn erases all savings from fewer tools.

Approach Tools in context Cache hit rate Effective cost
Stable tool set (30 tools) 30 every turn High Low
Dynamic RAG per step 5-15, varying Near zero High
Deferred loading (stable prefix) 8-10 core + search High Lowest

The Subtle Variant: Non-Deterministic Serialization

Languages like Swift and Go randomize dictionary key ordering during JSON serialization, so the cache sees a different byte sequence even when tools are identical — the same anti-pattern triggered accidentally.

Fix: sort keys deterministically before serialization.

The Correct Alternative: Deferred Tool Loading

Anthropic's Tool Search Tool achieves the same goal without breaking the cache prefix. Tools marked defer_loading: true are excluded from the prompt; the agent discovers them on demand.

Anthropic's evaluations:

Metric All tools loaded Deferred + search
Token usage ~55K ~8.7K
Accuracy (Opus) 49% 74%

The cache prefix stays identical across turns; deferred tools load into message history, invalidating nothing.

Anthropic's advanced tool use guidance recommends stratifying tools by access frequency:

Level Contents Cache impact
Core tools (3–5) Most-used, always loaded Cached prefix, never changes
General utilities bash, code execution Part of stable prefix
Specialized tools Domain-specific, MCP servers Deferred; loaded via search on demand

When This Backfires

Deferred loading adds a tool search round-trip per undiscovered tool. It provides no benefit when:

  • Tool library is small (<10 tools): Upfront loading costs less than repeated search overhead.
  • All tools are needed every request: Deferring tools you always use forces a search penalty with no savings.
  • Latency is the primary constraint: Real-time pipelines may not tolerate extra inference passes for tool discovery.
  • Tool search accuracy is low: Poor search hits cause missed tools, degrading task completion more than cache breaks cost.

When This Doesn't Apply

Stable tool sets are the right default for multi-turn agents, but there are cases where dynamic selection is fine:

  • Single-turn, cold-start requests: if every invocation is a fresh session with no prior cache to preserve, there is no accumulated prefix to protect. Cache continuity only pays off across turns.
  • Local inference without shared KV cache: some self-hosted backends (e.g., llama.cpp, Ollama) do not implement cross-request KV cache reuse. The 10x cost differential disappears entirely.
  • Very small tool sets (<5 tools, <500 tokens total): when tool definitions are negligible relative to message history, the absolute savings from cache hits may not justify the added complexity of a deferred-loading architecture.

In all other cases — multi-turn agents, API-hosted models, or any setup with repeated context — the cost asymmetry dominates and dynamic per-step fetching is counterproductive.

Key Takeaways

  • Any change to tool definitions invalidates the entire KV cache — continuity matters more than minimizing tool count.
  • Prefer deferred loading with a stable core set over dynamic RAG on tool definitions.
  • Audit JSON serialization for non-deterministic key ordering — an accidental cache-breaker.

Example

Anti-pattern — tool definitions change each turn, breaking the cache:

# BAD: tool list rebuilt per step — cache prefix changes every call
for step in plan:
    tools = fetch_tools_for_step(step)          # different subset each time
    response = client.messages.create(
        model="claude-sonnet-4-5",
        tools=tools,                             # cache invalidated every turn
        messages=history,
    )

Fix — stable core tools, deferred discovery via Tool Search:

# GOOD: stable prefix; agent discovers specialized tools on demand
CORE_TOOLS = load_core_tools()                  # same every call

response = client.messages.create(
    model="claude-sonnet-4-5",
    tools=CORE_TOOLS,                           # never changes → cache hits
    messages=history,
)
# Specialized tools are fetched inside message history via Tool Search Tool,
# invalidating nothing above the messages layer.

Sorting tool keys deterministically also prevents accidental cache breaks in languages with non-deterministic dict ordering:

import json

def stable_tool_schema(tool: dict) -> dict:
    return json.loads(json.dumps(tool, sort_keys=True))

CORE_TOOLS = [stable_tool_schema(t) for t in load_core_tools()]
Feedback