Skip to content

Per-Call Budget Hints on Tool Invocations

Let the caller raise the reasoning or returned-token cap on a single tool invocation — narrowly, when the call is infrequent and information-dense — instead of re-tuning the global default.

A coding agent's tool calls have uneven cost-quality curves. A grep that returns ten matches needs no extra reasoning; a deep web search across a regulatory corpus does. A per-call budget hint flags one invocation as "spend more here" without raising the budget for every other call. The hint pays off only when the call is infrequent, information-dense, and the model or tool can spend the lifted ceiling productively. Apply it uniformly and quality drops while cost rises.

The Three Shapes the Hint Takes

The contract has converged on three shapes:

Shape Set as Example
Binary opt-in default / unlimited OpenAI Responses web_search tool: return_token_budget: "unlimited" (OpenAI changelog 2026-05-11, web search tool guide)
Categorical tier low / medium / high / xhigh OpenAI reasoning.effort (reasoning guide); Anthropic effort for adaptive thinking (Anthropic effort docs)
Numeric ceiling budget_tokens: N Anthropic thinking.budget_tokens (deprecated on Sonnet 4.6, removed on Opus 4.7 in favour of adaptive effort) (extended thinking docs)

The shapes are not interchangeable. The binary opt-in attaches to a tool definition and shapes what the tool returns. The categorical tier and numeric ceiling attach to the message and shape how much the model thinks between tool calls.

When the Hint Pays Off

Liu et al. (2025) tested raising tool-call budgets on web-search agents and found that simply enlarging the budget "fails to improve agent performance" because the agent lacks awareness of remaining resources and "quickly hits a performance ceiling." The hint pays off only when three conditions hold together:

  • The call is infrequent. Routine high-volume tools — file read, grep, status checks — should run on the default. Lifting their ceiling compounds spend across every step with no per-call quality return.
  • The call is information-dense. Deep web search, long-document analysis, multi-page evaluation — calls where the answer's value scales with how much the tool can examine before stopping.
  • The model or tool can spend the headroom productively. OpenAI documents return_token_budget as applying only to GPT-5+ reasoning web search — not web_search_preview, Chat Completions search models, or non-reasoning web search (web search guide).

A useful default: apply the hint to the one or two calls a caller would name in a sentence, and leave every other call untouched.

Why It Works

Uniform global defaults misallocate compute. One max_tokens set high enough for the longest call inflates every short call; one set low enough for the shortest truncates the longest. Per-call hints move the allocation decision to where the most information exists — the call site, where the caller knows whether this is a deep-research run or a routine lookup. Liu et al. (2025, §3) frame this as a resource-awareness problem: the hint sets the upper bound and an in-context budget tracker provides the spend signal; both are needed to allocate efficiently within the lifted ceiling. OpenAI's return_token_budget is the simplest possible shape — binary default/unlimited — but the causal structure is the same (OpenAI changelog 2026-05-11).

When This Backfires

The pattern's failure surface is larger than its success surface.

  • Routine high-frequency calls. A hint applied to a tool called dozens of times per session compounds into substantial overspend with no quality benefit. The "lift ceiling for important calls" framing collapses to "lift ceiling for every call" the moment the caller's classifier is fuzzy.
  • Higher reasoning is not monotonically better. Su et al. (2026) report that "trajectories with higher PTE [prefill-token-equivalent] costs tend to have lower reasoning correctness" — more reasoning and more tool use do not equal better answers. Anthropic's own migration guidance: on Sonnet 4.6, "if you're not setting effort, you'll see higher latency and may notice the model overthinking. Start with effort set to medium and adjust from there" (Anthropic effort docs).
  • Budget-blind tools. If the underlying tool is a black box returning a fixed-shape result, the agent cannot spend the lifted ceiling intelligently. The hint primarily affects what comes back, not how the agent uses it.
  • Tier-name aliasing across models. OpenAI's reasoning.effort: high and Anthropic's effort: high are calibrated per-model and not equivalent — "the effort scale is calibrated per model, so the same level name does not represent the same underlying value across models" (Claude Code model config). A multi-provider harness that sets effort: high uniformly produces inconsistent compute spend.
  • API-surface churn. Anthropic's removal of budget_tokens on Opus 4.7 in favour of adaptive effort (extended thinking docs) proves that any specific numeric-budget surface is on a deprecation timer. Categorical tiers are the more portable shape.
  • Misclassified callers. A caller who labels every research call "high-effort" — easy under prompt pressure to "be thorough" — converts the optional hint into a default. The cost discipline the pattern was supposed to enable disappears.

Relation to Adjacent Patterns

Pattern Allocates Set by Per
Reasoning budget allocation Reasoning compute by phase Caller Workflow phase
Heuristic effort scaling Agent counts and tool-call ceilings System prompt Task tier
Interactive effort sliders Reasoning tier Human operator Turn
Dual-budget control Remaining tool calls + tokens Agent Candidate action
Per-call budget hint (this page) Returned-token or thinking-token ceiling Caller (agent or harness) Tool call
Effort-aware hooks Hook-side gate strictness Hook Tool call (read-side)

Lifting a per-call ceiling does not replace any of these. It composes with them: a reasoning sandwich allocates by phase, a heuristic scales effort by task tier, and a per-call hint lifts the ceiling on the one or two deepest invocations within that phase.

Example

A research agent runs a sequence of tool calls during a regulatory analysis. Most are routine. One — a web search across a multi-document corpus — needs to inspect many pages without stopping at the standard returned-token cap. The hint goes on that one call.

Before — global ceiling set conservatively, deep-research call truncates:

response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "high"},
    tools=[{"type": "web_search"}],
    input="Research the economic impact of semaglutide on global healthcare systems...",
)

After — global ceiling kept conservative, deep call lifts its own ceiling:

response = client.responses.create(
    model="gpt-5.5",
    reasoning={"effort": "xhigh"},
    tools=[
        {
            "type": "web_search",
            "return_token_budget": "unlimited",
        },
    ],
    input="Research the economic impact of semaglutide on global healthcare systems...",
)

The hint is set in the tool definition, not the message, so it scopes precisely to that tool invocation (OpenAI web search guide). Every other call in the session continues to run on the default budget. For long-running multi-search tasks, the same call can be run in background mode (background: true).

The Anthropic equivalent for the reasoning-side hint, on a model that still supports manual mode:

response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=20000,
    thinking={"type": "enabled", "budget_tokens": 16000},
    messages=[{"role": "user", "content": "Plan the audit of the auth surface..."}],
)

On Opus 4.7, the equivalent is thinking: {type: "adaptive"} with effort: "xhigh" — the manual numeric budget returns a 400 error (Anthropic extended thinking docs). New code targeting current models should prefer the categorical tier over the numeric ceiling.

Key Takeaways

  • A per-call budget hint is a caller-side knob that lifts the reasoning or returned-token ceiling on a single tool invocation without raising the global default.
  • The hint takes three shapes — binary opt-in, categorical tier, numeric ceiling — scoped to either the tool definition or the message. They are not interchangeable.
  • The hint pays off only when the call is infrequent, information-dense, and the model or tool can spend the headroom productively (Liu et al. 2025).
  • Higher is not monotonically better. Su et al. (2026) report that higher tool-integrated reasoning cost correlates with lower correctness; Anthropic's own guidance warns of overthinking on effort: high.
  • Numeric-budget surfaces are on a deprecation timer. Anthropic removed budget_tokens on Opus 4.7 in favour of adaptive effort — categorical tiers are the more portable shape.
  • Apply the hint to the top one or two calls in a session. Misclassified callers convert the optional hint into a default and defeat the cost discipline it was meant to enable.
Feedback