Cost-Aware Agent Design: Route by Complexity, Not Habit¶

Cost-aware agent design routes each task to the cheapest model that meets its complexity, escalating tier only when validation fails.

The Routing Principle¶

Model cost scales with token volume and tier. Top-tier models on every task waste compute; cheap models on complex tasks produce rework — FrugalGPT quantifies this as higher error rates when under-powered models handle tasks requiring multi-step reasoning.

Claude Code supports per-agent model selection via the model field, routing each task type to the appropriate tier.

Task Type	Characteristics	Model Tier
File search, exploration	High volume, low reasoning	Fast (e.g., Haiku)
Code implementation	Balanced capability and speed	Balanced (e.g., Sonnet)
Architecture, complex refactoring	Deep reasoning required	Powerful (e.g., Opus)

Agent Weight Taxonomy¶

Agent initialization cost — tokens consumed by system prompt, tool definitions, and loaded skills — directly affects composability. A community analysis classifies agents by initialization weight:

Weight Class	Token Budget	Composability
Lightweight	<3k tokens	High — fast startup, frequent delegation viable
Medium	10–15k tokens	Moderate — noticeable context cost per invocation
Heavy	25k+ tokens	Low — bottleneck in multi-agent workflows

Treat initialization cost as a performance budget. Lightweight agents get composed more because they fit within context constraints. Heavy agents are justified only when the task requires specialization that cannot be decomposed — otherwise, splitting into lighter sub-agents is cheaper and more composable.

big.LITTLE Multi-Model Orchestration¶

Borrowed from CPU architecture: powerful cores for demanding work, efficient cores for background tasks. Claude Code's Explore subagent implements this — Haiku handles read-only exploration while the main model reasons. A community analysis reports 2–2.5x cost reduction at 85–95% quality on mixed workloads.

Model rotation: start with the cheaper model, escalate only on validation failure. This works when validation is cheap and deterministic — test suites, linters, type checkers.

Cascade routing: FrugalGPT demonstrated up to 98% cost reduction by querying cheaper models first and escalating only when confidence is low. No coding tool implements this natively — the cascade pattern remains a manual or custom implementation. Approximate it with a two-pass pattern: fast model first, deterministic gate (tests, linter, type checker), escalate to capable model on failure.

Beyond cutting average spend, routing also affects the predictability of spend — escalation paths make per-task cost variable. LangChain describes techniques for making a coding agent's token spend predictable, treating bounded cost as an explicit design goal rather than a side effect of tier choice (LangChain — making coding-agent spend predictable).

Roo Code: Mode-Level Routing¶

Roo Code assigns different models to different modes — Architect, Code, Ask, Orchestrator — switching the model automatically when the mode switches. The recommended pattern pairs strong reasoning models with Architect mode and cheaper models with Code mode: the same cost segmentation as per-subagent assignment, but at the mode level.

Role-Based Multi-Model Routing¶

Complexity routing decides which tier; role routing decides which capability. The OPENDEV paper defines five model roles, each independently configurable with fallback chains for graceful degradation (Bui, 2026 §2.2.5):

Role	Purpose	Fallback
Action	Primary execution model for tool-based reasoning	Default for all workloads
Thinking	Extended reasoning without tool access	Action model
Critique	Self-evaluation of output (selective, not every turn)	Thinking model, then Action model
Vision	Vision-language model for screenshots and images	Action model (if vision-capable)
Compact	Fast summarization during context compaction	Action model

Provider abstraction separates role assignment from model identity — swap providers without touching agent code (Bui, 2026 §2.2.5). Clients initialize lazily (only models used in a session), and capabilities are cached locally with TTL refresh for offline startup.

Premium Request Economics¶

In GitHub Copilot, premium request multipliers make model choice a direct economic decision:

Model Tier	Examples	Multiplier
Budget	Claude Haiku 4.5, Grok Code Fast 1	0.25x–0.33x
Included	GPT-5 mini, GPT-4.1, GPT-4o	0x (no premium cost)
Standard	Claude Sonnet 4/4.5/4.6, Gemini 2.5 Pro	1x
Premium	Claude Opus 4.5/4.6	3x
Ultra	Claude Opus 4.6 (fast)	30x

Using a 1x model for tasks that a 0.33x model handles equally well consumes three times the premium request budget for no quality gain. GitHub Copilot's Auto mode provides a 10% multiplier discount and lets users override the selected model at any time (GitHub Changelog: Auto Model Selection).

Model Deprecation Awareness¶

Model IDs have finite lifespans — claude-3-haiku-20240307 is deprecated and retires April 19, 2026, with Haiku 4.5 as the migration path (Anthropic models page). Hardcoded IDs break silently at retirement; the API returns an error rather than routing to a successor. Use display names (haiku, sonnet, opus) where possible, and pin full IDs only when reproducibility requires it.

Tool SEO: Optimizing Agent Descriptions¶

Claude Code uses the description field for delegation decisions. A community analysis terms this "Tool SEO."

Activation keywords that improve delegation reliability:

"use PROACTIVELY" — triggers delegation without explicit request
"Use immediately after [action]" — ties delegation to workflow events
"[domain] specialist" — narrows matching to relevant tasks

Effective descriptions combine activation triggers, domain scope, and temporal context — features that help the orchestrator match tasks to agents.

When This Backfires¶

Validation gates are slow or absent. Cascade routing depends on deterministic, cheap validators (tests, linters, type checkers). If the validation step takes longer than the cost difference between tiers, the cascade adds latency without saving money. Measure gate cost before committing to escalation-based routing.

Single-task pipelines. A three-tier routing system — whether by task type or by code health — adds configuration and coordination overhead. For pipelines with one task type and low invocation volume, a single capable model at a fixed tier is simpler and often cheaper when amortized over setup and maintenance cost.

Frequently-updated model rosters. Role-based routing breaks when a provider deprecates or renames a model tier. Teams without automated model-ID management (display names like haiku, capability caching with TTL) spend engineering time on breakage rather than shipping features.

High task interdependency. When tasks cannot be cleanly separated by complexity — for example, a refactor that requires reasoning at every file edit — routing exploration to a fast model and implementation to a capable one creates friction: the fast model's findings must be re-ingested by the capable model, adding tokens and latency.

Anti-Patterns¶

Default everything to the top tier. Safe but wasteful at scale.

Heavy agents for simple tasks. Decompose into lightweight agents.

Vague agent descriptions. "Helps with code" gives no delegation signal.

Example¶

The following Claude Code sub-agent configuration routes file exploration to a fast model while reserving the balanced model for implementation tasks. The agent descriptions use activation keywords to guide orchestrator delegation.

{
  "agents": [
    {
      "name": "explorer",
      "model": "claude-haiku-4-5",
      "description": "Use PROACTIVELY for any read-only codebase exploration: searching files, reading source, tracing call paths, listing directories. Use immediately after receiving a new task to understand the codebase before implementing.",
      "tools": ["Read", "Glob", "Grep"],
      "system": "You are a read-only exploration agent. Never write or modify files. Return findings as structured summaries."
    },
    {
      "name": "implementer",
      "model": "claude-sonnet-4-5",
      "description": "Use for writing, editing, or refactoring code once exploration is complete. Implementation specialist — invoke after the explorer agent has mapped relevant files.",
      "tools": ["Read", "Edit", "Write", "Bash"]
    },
    {
      "name": "architect",
      "model": "claude-opus-4-5",
      "description": "Use for architectural decisions, complex refactors spanning more than 5 files, or tasks requiring deep cross-cutting reasoning. Invoke sparingly.",
      "tools": ["Read", "Edit", "Write", "Bash"]
    }
  ]
}

The explorer description combines "Use PROACTIVELY" with "Use immediately after receiving a new task" — activation keywords that push the orchestrator to delegate exploration automatically, keeping Haiku on high-volume read-only work while Sonnet and Opus stay reserved for tasks that justify their cost.

Key Takeaways¶

Route tasks to models by complexity — exploration to fast, implementation to balanced, architecture to powerful.
Assign models by role (action, thinking, critique, vision, compact) with independent fallback chains for graceful degradation.
Classify agents by initialization token weight: lightweight (<3k), medium (10–15k), heavy (25k+). Prefer lightweight.
Craft agent descriptions with activation keywords to improve orchestrator delegation accuracy.
Use display names (haiku, sonnet, opus) rather than pinned model IDs to avoid silent breakage at model retirement.
Cascade routing (cheap model first, escalate on validation failure) approximates FrugalGPT-style savings without native tooling support.

Reasoning Budget Allocation — budget reasoning effort the same way you budget tier choice
Heuristic-Based Effort Scaling — heuristics that scale model effort by task signals
Cross-Vendor Competitive Routing — extend tier routing across providers
Code-Health-Gated LLM Tier Routing — route by file-level code health metrics rather than task type
Cognitive Reasoning vs Execution: A Two-Layer Agent Architecture — role split that complements tier routing
Claude Code Sub-Agents — per-agent model selection mechanic
Token-Efficient Tool Design — pair tier routing with lean tool surfaces
Minimum-Sufficient Control Ladder — orthogonal axis: this page escalates model tier by task complexity; the ladder escalates control mechanism by named failure mode