Reasoning Budget Allocation: The Reasoning Sandwich¶
Allocate maximum reasoning compute to planning and verification phases, reduced compute to execution — rather than using a fixed level throughout.
The Pattern¶
Not all steps in an agent workflow require the same depth of reasoning. Planning and verification are high-stakes; execution is largely mechanical.
LangChain's deep agent experiments tested a "reasoning sandwich" — extra-high at planning, high at execution, extra-high at verification (xhigh-high-xhigh). It scored highest on Terminal Bench 2.0 (66.5%), beating both continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).
graph LR
A[Planning<br/>Extra-high compute] --> B[Execution<br/>High compute]
B --> C[Verification<br/>Extra-high compute]
Phase Breakdown¶
Planning — extra-high compute. Map the problem space: requirements, approach, risks. Errors here propagate through every subsequent step.
Execution — high compute. Follow the plan: writing code, running commands. Reduced compute handles mechanical steps while lowering per-step cost.
Verification — extra-high compute. Check output against requirements, run tests. A missed failure produces false completion.
Dual-Mode Operation¶
The OPENDEV paper implements the sandwich architecturally through two modes (Bui, 2026 §2.2.2):
- Plan Mode: planning delegated to a Planner subagent whose schema contains only read-only tools (subagent schema-level tool filtering) — eliminating state machine complexity
- Normal Mode: full tool access for implementation
Mode switching triggers via explicit command (/plan) or planning-intent heuristics. This maps to the sandwich: Plan Mode (extra-high compute) → Normal Mode execution (high) → verification (extra-high).
An optional thinking phase adds a separate inference call using a dedicated Thinking model before action selection (Bui, 2026 §2.2.6) — amplifying any phase where deeper reasoning is needed.
Extended Thinking Budget Triggers¶
Extended thinking allocates a dedicated reasoning budget before the model generates its response — distinct from the think tool, which reasons mid-stream between tool calls.
In Claude Code, including "ultrathink" in a skill's content enables extended thinking, allocating maximum thinking tokens for that skill.
Maximum Thinking as a Cost-Performance Tradeoff¶
A community analysis positions maximum-thinking on a balanced model as an alternative to model tier upgrades. Exhausting the thinking budget on a cheaper model costs less than switching tiers — a tradeoff worth evaluating before moving to a higher-cost tier for reasoning-heavy tasks.
This stacks with other techniques:
- Extended thinking — maximum reasoning tokens via trigger keyword
- Plan mode — structured planning before execution
- Iterative critique — systematic self-review cycles to catch edge cases
Each layer adds cost; combine them when the task warrants the investment.
Applying Budget Triggers¶
- Claude Code skills: include "ultrathink" in
SKILL.mdcontent to enable extended thinking - Claude API: set the
thinkingbudget parameter per call — high for planning/verification, standard for execution - Any tool with model routing: route planning and verification to a capable model, execution to a cheaper one
For tools without per-call configuration, approximate through prompt structure: deep reasoning guidance in planning prompts, less in execution.
Why It Works¶
Different phases impose structurally different cognitive demands (Bui, 2026 §2.2.5): planning requires exploring the possibility space and accounting for requirements, edge cases, and risks — errors here propagate downstream; execution follows a decided plan and is largely mechanical; verification must compare output against requirements precisely, where a missed failure produces false completion. Applying uniform maximum compute to execution wastes budget on mechanical steps and — as the LangChain benchmark showed — causes timeouts that degrade completion rates. Concentrating compute where ambiguity is highest balances cost against quality.
When to Apply¶
The sandwich pays off when:
- Tasks have discrete planning, execution, and verification phases
- Planning mistakes are costly to repair after implementation begins
- Verification failures would be falsely reported as success
Single-step tasks and independent parallel tool calls see no benefit from added reasoning overhead.
When This Backfires¶
The 3% gap between the sandwich (66.5%) and uniform high (63.6%) does not always justify harness complexity. The sandwich is worse than uniform compute when:
- Phases are not cleanly separable. Exploratory debugging and interleaved planning/execution force misclassified routing — the sandwich degrades to noisy uniform compute with routing overhead.
- Mode-switching adds more bugs than it prevents. Teams without the budget for reliable planner/executor/verifier routing fare better with a single tier at uniform high reasoning.
- Verification is cheap relative to planning. When correctness is checked by tests or types, extra-high model-based verification duplicates what the test harness already does.
- Execution dominates the trajectory. Bulk refactors and migrations spend most tokens in execution; reducing compute there saves little while planning/verification contribute a small share of cost.
Key Takeaways¶
- Planning and verification warrant extra-high reasoning compute; execution warrants high.
- The sandwich achieved the highest completion rate (66.5%) in LangChain benchmarks, outperforming continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).
- Extended thinking triggers (e.g., "ultrathink" in Claude Code skills) front-load reasoning before generation — distinct from mid-stream think tool reasoning.
- Maximum-thinking on a balanced model may cost less than a tier upgrade for reasoning-heavy tasks — evaluate before switching tiers.
- Stack extended thinking with plan mode and iterative critique for tasks that warrant the added cost.
- Dual-mode operation (plan/normal) enforces the sandwich architecturally by restricting tool access per phase.
Example¶
A Claude API implementation routing by phase:
def run_sandwich(task: str) -> str:
# Planning — extra-high thinking budget
plan = client.messages.create(
model="claude-opus-4-5",
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": f"Plan: {task}"}],
)
# Execution — standard thinking budget
result = client.messages.create(
model="claude-opus-4-5",
thinking={"type": "enabled", "budget_tokens": 2000},
messages=[{"role": "user", "content": f"Execute plan:\n{plan.content[0].text}\nTask: {task}"}],
)
# Verification — extra-high thinking budget
verdict = client.messages.create(
model="claude-opus-4-5",
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{"role": "user", "content": f"Verify result meets requirements:\n{result.content[1].text}"}],
)
return verdict.content[1].text
In Claude Code skills, add ultrathink to the skill content for planning and verification skills, and omit it for execution skills.
Related¶
- Discrete Phase Separation
- Heuristic-Based Effort Scaling
- Cost-Aware Agent Design
- Code-Health-Gated LLM Tier Routing — pre-generation model tier selection via code health metrics
- Know When Not to Add Structured Reasoning
- Cognitive Reasoning vs Execution: A Two-Layer Agent
- Think Tool
- Harness Engineering