Skip to content

MCP alwaysLoad: Classifying Servers as Eager or Just-in-Time

Classify each MCP server as eager (alwaysLoad) or just-in-time by weighing always-paid context tax against on-demand discovery cost.

Related lesson: Tool-Call Cost & Latency Budgeting — this concept features in a hands-on lesson with quizzes.

The Decision Surface

Claude Code 2.1.121 added an alwaysLoad option to MCP server config: when true, all tools from that server skip tool-search deferral and are always available (Claude Code changelog). The flag is the server-level inverse of the API's per-tool defer_loading: true parameter (Tool search tool docs).

Without alwaysLoad, a server's tools sit behind tool search — the model invokes a search step that returns 3-5 matching tool_reference blocks, which expand into full definitions inline. With alwaysLoad: true, every tool from that server is in the system-prompt prefix from the first turn.

The pattern generalises. Any host with deferred MCP discovery — current Claude Code, future MCP-aware Cursor or Copilot — faces the same per-server choice: which servers are core enough to deserve unconditional context residence, and which can be paid for only on demand? Other harnesses push the deferral axis further than per-server: OpenAI Codex added per-thread MCP executor activation with plugin discovery, so executors activate per conversation thread rather than globally (OpenAI Codex changelog) — narrowing the load decision to a per-conversation scope.

What Each Side Costs

Eager loading burns prefix tokens every turn whether the tool is used or not. A typical multi-server setup (GitHub, Slack, Sentry, Grafana, Splunk) consumes ~55K tokens in tool definitions before the conversation starts (Tool search tool docs). It also dilutes selection accuracy: Anthropic reports tool selection degrades significantly past 30-50 visible tools (Tool search tool docs).

JIT loading pays a search round-trip on first reference — a server_tool_use call before the actual tool call — and depends on the model's ability to phrase a search query that retrieves the right tool. Tool descriptions that don't match natural-language tasks fail silently; the agent reports "no tool available" when one exists.

JIT loading does not break prompt cache at the API design level. The docs are explicit: "Deferred tools are not included in the system-prompt prefix. When the model discovers a deferred tool through tool search, the tool definition is appended inline as a tool_reference block in the conversation. The prefix is untouched, so prompt caching is preserved" (Tool search tool docs). This distinguishes deferred loading from manual per-step tool swapping, which does invalidate cache (see Dynamic Tool Fetching Cache Break).

The caveat is harness-level, not API-level: a deferred tool definition and cache_control are mutually exclusive on the same tool. The API rejects setting both — "Tools with defer_loading cannot use prompt caching" — and a Claude Code build that applied both flags unconditionally surfaced this as a hard error once 3+ MCP servers were configured (claude-code#30920). The prefix-level cache benefit still holds, but do not assume an individual deferred tool block is itself cacheable, and verify your host doesn't double-set the flags.

Classification Rubric

Four signals decide each server:

Signal Eager (alwaysLoad: true) JIT (default)
Hit rate Used in most sessions Used in <20% of sessions
Definition size Small relative to budget Large enough to dominate
Latency tolerance Cold-call round-trip is felt Search step is acceptable
Description quality Trustworthy under search Keywords don't match task language

The Anthropic docs operationalise the first two as a floor: tool search is worth enabling at 10+ tools or >10K tokens of total tool definitions (Tool search tool docs). Below that, eager-load every server. Above it, the per-server alwaysLoad decision matters.

The same docs give the explicit eager-load recommendation: "Keep your 3-5 most frequently used tools as non-deferred for optimal performance." alwaysLoad is the server-level mechanism for satisfying this guidance when those tools live behind an MCP server.

Failure Modes

  • Silent context bloat — eager-loading a rarely-used MCP server pays the full token cost every turn for ambient capability the agent never invokes. Detect with per-source context attribution (see Context Usage Attribution).
  • Cold-start prompt churn — JIT-loading the common-case server adds a search round-trip to the first turn that needs it, costing latency and a discovery prompt that crowds the first-turn budget.
  • Selection cliff — eager-loading too many servers past the ~30-50 tool threshold degrades the agent's ability to pick correctly even when the right tool is visible.
  • Search-miss invisibility — JIT-loading a server whose tool descriptions use jargon the model doesn't search for produces "tool not available" errors when the tool is right there. Audit description craft before deferring.

Example

A team runs five MCP servers in Claude Code: a project-specific search server (used every turn), a GitHub server (used in ~80% of sessions), a Linear server (used in ~15% of sessions for ticket lookup), and Sentry plus Grafana (used during incidents only).

{
  "mcpServers": {
    "project-search": {
      "command": "mcp-server-search",
      "alwaysLoad": true
    },
    "github": {
      "command": "mcp-server-github",
      "alwaysLoad": true
    },
    "linear": {
      "command": "mcp-server-linear"
    },
    "sentry": {
      "command": "mcp-server-sentry"
    },
    "grafana": {
      "command": "mcp-server-grafana"
    }
  }
}

Project search and GitHub stay in the prefix every turn. Linear, Sentry, and Grafana load only when the model issues a tool search for "ticket", "error", or "metric" — paying the round-trip on the rare turns those capabilities are needed instead of the eager token tax on every turn.

Key Takeaways

  • alwaysLoad: true is the server-level switch that bypasses tool-search deferral for MCP servers in Claude Code 2.1.121
  • Below 10 tools or 10K tokens of definitions, eager-load everything; above it, classify per server using hit rate, size, latency tolerance, and description quality
  • Deferred loading preserves prompt cache because tool definitions append inline rather than mutate the prefix
  • Eager-load the 3-5 most frequently used tools per the Anthropic guidance; defer the rest
  • Audit descriptions before deferring — search-miss is the silent failure mode that turns JIT into "tool not available"
Feedback