Dynamic Tool Fetching Breaks KV Cache¶

Loading tool definitions dynamically per step seems like good context management but destroys the single most impactful cost optimization available: prompt caching.

The Intuition Trap¶

Fewer tools means fewer tokens, so fetching only needed tools per step seems optimal. It is not: savings from removing tools are dwarfed by breaking prompt cache continuity.

Why It Fails¶

Tool definitions sit at the top of the cache hierarchy. The prefix is computed in order: tools → system → messages. Any change to tool definitions invalidates every subsequent level.

graph LR
    A["tools (top)"] --> B["system"] --> C["messages"]
    style A fill:#d32f2f,color:#fff
    style B fill:#f57c00,color:#fff
    style C fill:#fbc02d,color:#000

Cached tokens cost 10x less than uncached — Claude Sonnet 4's cache-read rate is $0.30/MTok against a $3/MTok base input (Anthropic prompt caching). A single cache break per turn erases all savings from fewer tools.

Approach	Tools in context	Cache hit rate	Effective cost
Stable tool set (30 tools)	30 every turn	High	Low
Dynamic RAG per step	5-15, varying	Near zero	High
Deferred loading (stable prefix)	8-10 core + search	High	Lowest

The Subtle Variant: Non-Deterministic Serialization¶

Languages like Swift and Go randomize dictionary key ordering during JSON serialization, so the cache sees a different byte sequence even when tools are identical — the same anti-pattern triggered accidentally.

Fix: sort keys deterministically before serialization.

The Correct Alternative: Deferred Tool Loading¶

Anthropic's Tool Search Tool achieves the same goal without breaking the cache prefix. Tools marked defer_loading: true are excluded from the prompt; the agent discovers them on demand.

Anthropic's evaluations:

Metric	All tools loaded	Deferred + search
Token usage	~55K	~8.7K
Accuracy (Opus)	49%	74%

The cache prefix stays identical across turns; deferred tools load into message history, invalidating nothing.

Recommended Tool Architecture¶

Anthropic's advanced tool use guidance recommends stratifying tools by access frequency:

Level	Contents	Cache impact
Core tools (3–5)	Most-used, always loaded	Cached prefix, never changes
General utilities	bash, code execution	Part of stable prefix
Specialized tools	Domain-specific, MCP servers	Deferred; loaded via search on demand

When This Backfires¶

Deferred loading adds a tool search round-trip per undiscovered tool. It provides no benefit when:

Tool library is small (<10 tools): Upfront loading costs less than repeated search overhead.
All tools are needed every request: Deferring tools you always use forces a search penalty with no savings.
Latency is the primary constraint: Real-time pipelines may not tolerate extra inference passes for tool discovery.
Tool search accuracy is low: Poor search hits cause missed tools, degrading task completion more than cache breaks cost.

When This Doesn't Apply¶

Stable tool sets are the right default for multi-turn agents, but there are cases where dynamic selection is fine:

Single-turn, cold-start requests: if every invocation is a fresh session with no prior cache to preserve, there is no accumulated prefix to protect. Cache continuity only pays off across turns.
Local inference without shared KV cache: some self-hosted backends (e.g., llama.cpp, Ollama) do not implement cross-request KV cache reuse. The 10x cost differential disappears entirely.
Very small tool sets (<5 tools, <500 tokens total): when tool definitions are negligible relative to message history, the absolute savings from cache hits may not justify the added complexity of a deferred-loading architecture.

In all other cases — multi-turn agents, API-hosted models, or any setup with repeated context — the cost asymmetry dominates and dynamic per-step fetching is counterproductive.

Key Takeaways¶

Any change to tool definitions invalidates the entire KV cache — continuity matters more than minimizing tool count.
Prefer deferred loading with a stable core set over dynamic RAG on tool definitions.
Audit JSON serialization for non-deterministic key ordering — an accidental cache-breaker.

Example¶

Anti-pattern — tool definitions change each turn, breaking the cache:

# BAD: tool list rebuilt per step — cache prefix changes every call
for step in plan:
    tools = fetch_tools_for_step(step)          # different subset each time
    response = client.messages.create(
        model="claude-sonnet-4-5",
        tools=tools,                             # cache invalidated every turn
        messages=history,
    )

Fix — stable core tools, deferred discovery via Tool Search:

# GOOD: stable prefix; agent discovers specialized tools on demand
CORE_TOOLS = load_core_tools()                  # same every call

response = client.messages.create(
    model="claude-sonnet-4-5",
    tools=CORE_TOOLS,                           # never changes → cache hits
    messages=history,
)
# Specialized tools are fetched inside message history via Tool Search Tool,
# invalidating nothing above the messages layer.

Sorting tool keys deterministically also prevents accidental cache breaks in languages with non-deterministic dict ordering:

import json

def stable_tool_schema(tool: dict) -> dict:
    return json.loads(json.dumps(tool, sort_keys=True))

CORE_TOOLS = [stable_tool_schema(t) for t in load_core_tools()]

Prompt Caching as Architectural Discipline
Token-Efficient Tool Design
Tool Minimalism
Advanced Tool Use: Scaling Agent Tool Libraries — full documentation of deferred tool loading and the Tool Search Tool
Infinite Context Anti-Pattern
Token Preservation Backfire
Cost-Aware Agent Design
Context Engineering
Static Content First: Maximizing Prompt Cache Hits
Disable Attribution Headers to Preserve KV Cache in Local Inference
MCP: The Open Protocol Connecting Agents to External Tools
Filesystem-Based Tool Discovery