Reasoning Budget Allocation: The Reasoning Sandwich¶

Allocate maximum reasoning compute to planning and verification phases, reduced compute to execution — rather than using a fixed level throughout.

The Pattern¶

Not all steps in an agent workflow require the same depth of reasoning. Planning and verification are high-stakes; execution is largely mechanical.

LangChain's deep agent experiments tested a "reasoning sandwich" — extra-high at planning, high at execution, extra-high at verification (xhigh-high-xhigh). It scored highest on Terminal Bench 2.0 (66.5%), beating both continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).

graph LR
    A[Planning<br/>Extra-high compute] --> B[Execution<br/>High compute]
    B --> C[Verification<br/>Extra-high compute]

Phase Breakdown¶

Planning — extra-high compute. Map the problem space: requirements, approach, risks. Errors here propagate through every subsequent step.

Execution — high compute. Follow the plan: writing code, running commands. Reduced compute handles mechanical steps while lowering per-step cost.

Verification — extra-high compute. Check output against requirements, run tests. A missed failure produces false completion.

Dual-Mode Operation¶

The OPENDEV paper implements the sandwich architecturally through two modes (Bui, 2026 §2.2.2):

Plan Mode: planning delegated to a Planner subagent whose schema contains only read-only tools (subagent schema-level tool filtering) — eliminating state machine complexity
Normal Mode: full tool access for implementation

Mode switching triggers via explicit command (/plan) or planning-intent heuristics. This maps to the sandwich: Plan Mode (extra-high compute) → Normal Mode execution (high) → verification (extra-high).

An optional thinking phase adds a separate inference call using a dedicated Thinking model before action selection (Bui, 2026 §2.2.6) — amplifying any phase where deeper reasoning is needed.

Extended Thinking Budget Triggers¶

Extended thinking allocates a dedicated reasoning budget before the model generates its response — distinct from the think tool, which reasons mid-stream between tool calls.

In Claude Code, including "ultrathink" in a skill's content enables extended thinking, allocating maximum thinking tokens for that skill.

Maximum Thinking as a Cost-Performance Tradeoff¶

A community analysis positions maximum-thinking on a balanced model as an alternative to model tier upgrades. Exhausting the thinking budget on a cheaper model costs less than switching tiers — a tradeoff worth evaluating before moving to a higher-cost tier for reasoning-heavy tasks.

This stacks with other techniques:

Extended thinking — maximum reasoning tokens via trigger keyword
Plan mode — structured planning before execution
Iterative critique — systematic self-review cycles to catch edge cases

Each layer adds cost; combine them when the task warrants the investment.

Applying Budget Triggers¶

Claude Code skills: include "ultrathink" in SKILL.md content to enable extended thinking
Claude API: set the thinking budget parameter per call — high for planning/verification, standard for execution
Any tool with model routing: route planning and verification to a capable model, execution to a cheaper one

For tools without per-call configuration, approximate through prompt structure: deep reasoning guidance in planning prompts, less in execution.

Why It Works¶

Different phases impose structurally different cognitive demands (Bui, 2026 §2.2.5): planning requires exploring the possibility space and accounting for requirements, edge cases, and risks — errors here propagate downstream; execution follows a decided plan and is largely mechanical; verification must compare output against requirements precisely, where a missed failure produces false completion. Applying uniform maximum compute to execution wastes budget on mechanical steps and — as the LangChain benchmark showed — causes timeouts that degrade completion rates. Concentrating compute where ambiguity is highest balances cost against quality.

When to Apply¶

The sandwich pays off when:

Tasks have discrete planning, execution, and verification phases
Planning mistakes are costly to repair after implementation begins
Verification failures would be falsely reported as success

Single-step tasks and independent parallel tool calls see no benefit from added reasoning overhead.

When This Backfires¶

The 3% gap between the sandwich (66.5%) and uniform high (63.6%) does not always justify harness complexity. The sandwich is worse than uniform compute when:

Phases are not cleanly separable. Exploratory debugging and interleaved planning/execution force misclassified routing — the sandwich degrades to noisy uniform compute with routing overhead.
Mode-switching adds more bugs than it prevents. Teams without the budget for reliable planner/executor/verifier routing fare better with a single tier at uniform high reasoning.
Verification is cheap relative to planning. When correctness is checked by tests or types, extra-high model-based verification duplicates what the test harness already does.
Execution dominates the trajectory. Bulk refactors and migrations spend most tokens in execution; reducing compute there saves little while planning/verification contribute a small share of cost.

Key Takeaways¶

Planning and verification warrant extra-high reasoning compute; execution warrants high.
The sandwich achieved the highest completion rate (66.5%) in LangChain benchmarks, outperforming continuous maximum reasoning (53.9%, penalized by timeouts) and uniform high reasoning (63.6%).
Extended thinking triggers (e.g., "ultrathink" in Claude Code skills) front-load reasoning before generation — distinct from mid-stream think tool reasoning.
Maximum-thinking on a balanced model may cost less than a tier upgrade for reasoning-heavy tasks — evaluate before switching tiers.
Stack extended thinking with plan mode and iterative critique for tasks that warrant the added cost.
Dual-mode operation (plan/normal) enforces the sandwich architecturally by restricting tool access per phase.

Example¶

A Claude API implementation routing by phase:

def run_sandwich(task: str) -> str:
    # Planning — extra-high thinking budget
    plan = client.messages.create(
        model="claude-opus-4-5",
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=[{"role": "user", "content": f"Plan: {task}"}],
    )

    # Execution — standard thinking budget
    result = client.messages.create(
        model="claude-opus-4-5",
        thinking={"type": "enabled", "budget_tokens": 2000},
        messages=[{"role": "user", "content": f"Execute plan:\n{plan.content[0].text}\nTask: {task}"}],
    )

    # Verification — extra-high thinking budget
    verdict = client.messages.create(
        model="claude-opus-4-5",
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=[{"role": "user", "content": f"Verify result meets requirements:\n{result.content[1].text}"}],
    )
    return verdict.content[1].text

In Claude Code skills, add ultrathink to the skill content for planning and verification skills, and omit it for execution skills.

Discrete Phase Separation
Heuristic-Based Effort Scaling
Cost-Aware Agent Design
Code-Health-Gated LLM Tier Routing — pre-generation model tier selection via code health metrics
Know When Not to Add Structured Reasoning
Cognitive Reasoning vs Execution: A Two-Layer Agent
Think Tool
Harness Engineering