Dynamic Tool Fetching Breaks KV Cache¶
Loading tool definitions dynamically per step seems like good context management but destroys the single most impactful cost optimization available: prompt caching.
The Intuition Trap¶
Fewer tools means fewer tokens, so fetching only needed tools per step seems optimal. It is not: savings from removing tools are dwarfed by breaking prompt cache continuity.
Why It Fails¶
Tool definitions sit at the top of the cache hierarchy. The prefix is computed in order: tools → system → messages. Any change to tool definitions invalidates every subsequent level.
graph LR
A["tools (top)"] --> B["system"] --> C["messages"]
style A fill:#d32f2f,color:#fff
style B fill:#f57c00,color:#fff
style C fill:#fbc02d,color:#000
Cached tokens cost 10x less than uncached — Claude Sonnet 4's cache-read rate is $0.30/MTok against a $3/MTok base input (Anthropic prompt caching). A single cache break per turn erases all savings from fewer tools.
| Approach | Tools in context | Cache hit rate | Effective cost |
|---|---|---|---|
| Stable tool set (30 tools) | 30 every turn | High | Low |
| Dynamic RAG per step | 5-15, varying | Near zero | High |
| Deferred loading (stable prefix) | 8-10 core + search | High | Lowest |
The Subtle Variant: Non-Deterministic Serialization¶
Languages like Swift and Go randomize dictionary key ordering during JSON serialization, so the cache sees a different byte sequence even when tools are identical — the same anti-pattern triggered accidentally.
Fix: sort keys deterministically before serialization.
The Correct Alternative: Deferred Tool Loading¶
Anthropic's Tool Search Tool achieves the same goal without breaking the cache prefix. Tools marked defer_loading: true are excluded from the prompt; the agent discovers them on demand.
Anthropic's evaluations:
| Metric | All tools loaded | Deferred + search |
|---|---|---|
| Token usage | ~55K | ~8.7K |
| Accuracy (Opus) | 49% | 74% |
The cache prefix stays identical across turns; deferred tools load into message history, invalidating nothing.
Recommended Tool Architecture¶
Anthropic's advanced tool use guidance recommends stratifying tools by access frequency:
| Level | Contents | Cache impact |
|---|---|---|
| Core tools (3–5) | Most-used, always loaded | Cached prefix, never changes |
| General utilities | bash, code execution | Part of stable prefix |
| Specialized tools | Domain-specific, MCP servers | Deferred; loaded via search on demand |
When This Backfires¶
Deferred loading adds a tool search round-trip per undiscovered tool. It provides no benefit when:
- Tool library is small (<10 tools): Upfront loading costs less than repeated search overhead.
- All tools are needed every request: Deferring tools you always use forces a search penalty with no savings.
- Latency is the primary constraint: Real-time pipelines may not tolerate extra inference passes for tool discovery.
- Tool search accuracy is low: Poor search hits cause missed tools, degrading task completion more than cache breaks cost.
When This Doesn't Apply¶
Stable tool sets are the right default for multi-turn agents, but there are cases where dynamic selection is fine:
- Single-turn, cold-start requests: if every invocation is a fresh session with no prior cache to preserve, there is no accumulated prefix to protect. Cache continuity only pays off across turns.
- Local inference without shared KV cache: some self-hosted backends (e.g., llama.cpp, Ollama) do not implement cross-request KV cache reuse. The 10x cost differential disappears entirely.
- Very small tool sets (<5 tools, <500 tokens total): when tool definitions are negligible relative to message history, the absolute savings from cache hits may not justify the added complexity of a deferred-loading architecture.
In all other cases — multi-turn agents, API-hosted models, or any setup with repeated context — the cost asymmetry dominates and dynamic per-step fetching is counterproductive.
Key Takeaways¶
- Any change to tool definitions invalidates the entire KV cache — continuity matters more than minimizing tool count.
- Prefer deferred loading with a stable core set over dynamic RAG on tool definitions.
- Audit JSON serialization for non-deterministic key ordering — an accidental cache-breaker.
Example¶
Anti-pattern — tool definitions change each turn, breaking the cache:
# BAD: tool list rebuilt per step — cache prefix changes every call
for step in plan:
tools = fetch_tools_for_step(step) # different subset each time
response = client.messages.create(
model="claude-sonnet-4-5",
tools=tools, # cache invalidated every turn
messages=history,
)
Fix — stable core tools, deferred discovery via Tool Search:
# GOOD: stable prefix; agent discovers specialized tools on demand
CORE_TOOLS = load_core_tools() # same every call
response = client.messages.create(
model="claude-sonnet-4-5",
tools=CORE_TOOLS, # never changes → cache hits
messages=history,
)
# Specialized tools are fetched inside message history via Tool Search Tool,
# invalidating nothing above the messages layer.
Sorting tool keys deterministically also prevents accidental cache breaks in languages with non-deterministic dict ordering:
import json
def stable_tool_schema(tool: dict) -> dict:
return json.loads(json.dumps(tool, sort_keys=True))
CORE_TOOLS = [stable_tool_schema(t) for t in load_core_tools()]
Related¶
- Prompt Caching as Architectural Discipline
- Token-Efficient Tool Design
- Tool Minimalism
- Advanced Tool Use: Scaling Agent Tool Libraries — full documentation of deferred tool loading and the Tool Search Tool
- Infinite Context Anti-Pattern
- Token Preservation Backfire
- Cost-Aware Agent Design
- Context Engineering
- Static Content First: Maximizing Prompt Cache Hits
- Disable Attribution Headers to Preserve KV Cache in Local Inference
- MCP: The Open Protocol Connecting Agents to External Tools
- Filesystem-Based Tool Discovery