Skip to content

Cost-Aware Agent Design: Route by Complexity, Not Habit

Cost-aware agent design routes each task to the cheapest model that meets its complexity, escalating tier only when validation fails.

The Routing Principle

Model cost scales with token volume and tier. Top-tier models on every task waste compute; cheap models on complex tasks produce rework — FrugalGPT quantifies this as higher error rates when under-powered models handle tasks requiring multi-step reasoning.

Claude Code supports per-agent model selection via the model field, routing each task type to the appropriate tier.

Task Type Characteristics Model Tier
File search, exploration High volume, low reasoning Fast (e.g., Haiku)
Code implementation Balanced capability and speed Balanced (e.g., Sonnet)
Architecture, complex refactoring Deep reasoning required Powerful (e.g., Opus)

Agent Weight Taxonomy

Agent initialization cost — tokens consumed by system prompt, tool definitions, and loaded skills — directly affects composability. A community analysis classifies agents by initialization weight:

Weight Class Token Budget Composability
Lightweight <3k tokens High — fast startup, frequent delegation viable
Medium 10–15k tokens Moderate — noticeable context cost per invocation
Heavy 25k+ tokens Low — bottleneck in multi-agent workflows

Treat initialization cost as a performance budget. Lightweight agents get composed more because they fit within context constraints. Heavy agents are justified only when the task requires specialization that cannot be decomposed — otherwise, splitting into lighter sub-agents is cheaper and more composable.

big.LITTLE Multi-Model Orchestration

Borrowed from CPU architecture: powerful cores for demanding work, efficient cores for background tasks. Claude Code's Explore subagent implements this — Haiku handles read-only exploration while the main model reasons. A community analysis reports 2–2.5x cost reduction at 85–95% quality on mixed workloads.

Model rotation: start with the cheaper model, escalate only on validation failure. This works when validation is cheap and deterministic — test suites, linters, type checkers.

Cascade routing: FrugalGPT demonstrated up to 98% cost reduction by querying cheaper models first and escalating only when confidence is low. No coding tool implements this natively — the cascade pattern remains a manual or custom implementation. Approximate it with a two-pass pattern: fast model first, deterministic gate (tests, linter, type checker), escalate to capable model on failure.

Beyond cutting average spend, routing also affects the predictability of spend — escalation paths make per-task cost variable. LangChain describes techniques for making a coding agent's token spend predictable, treating bounded cost as an explicit design goal rather than a side effect of tier choice (LangChain — making coding-agent spend predictable).

Roo Code: Mode-Level Routing

Roo Code assigns different models to different modes — Architect, Code, Ask, Orchestrator — switching the model automatically when the mode switches. The recommended pattern pairs strong reasoning models with Architect mode and cheaper models with Code mode: the same cost segmentation as per-subagent assignment, but at the mode level.

Role-Based Multi-Model Routing

Complexity routing decides which tier; role routing decides which capability. The OPENDEV paper defines five model roles, each independently configurable with fallback chains for graceful degradation (Bui, 2026 §2.2.5):

Role Purpose Fallback
Action Primary execution model for tool-based reasoning Default for all workloads
Thinking Extended reasoning without tool access Action model
Critique Self-evaluation of output (selective, not every turn) Thinking model, then Action model
Vision Vision-language model for screenshots and images Action model (if vision-capable)
Compact Fast summarization during context compaction Action model

Provider abstraction separates role assignment from model identity — swap providers without touching agent code (Bui, 2026 §2.2.5). Clients initialize lazily (only models used in a session), and capabilities are cached locally with TTL refresh for offline startup.

Premium Request Economics

In GitHub Copilot, premium request multipliers make model choice a direct economic decision:

Model Tier Examples Multiplier
Budget Claude Haiku 4.5, Grok Code Fast 1 0.25x–0.33x
Included GPT-5 mini, GPT-4.1, GPT-4o 0x (no premium cost)
Standard Claude Sonnet 4/4.5/4.6, Gemini 2.5 Pro 1x
Premium Claude Opus 4.5/4.6 3x
Ultra Claude Opus 4.6 (fast) 30x

Using a 1x model for tasks that a 0.33x model handles equally well consumes three times the premium request budget for no quality gain. GitHub Copilot's Auto mode provides a 10% multiplier discount and lets users override the selected model at any time (GitHub Changelog: Auto Model Selection).

Model Deprecation Awareness

Model IDs have finite lifespans — claude-3-haiku-20240307 is deprecated and retires April 19, 2026, with Haiku 4.5 as the migration path (Anthropic models page). Hardcoded IDs break silently at retirement; the API returns an error rather than routing to a successor. Use display names (haiku, sonnet, opus) where possible, and pin full IDs only when reproducibility requires it.

Tool SEO: Optimizing Agent Descriptions

Claude Code uses the description field for delegation decisions. A community analysis terms this "Tool SEO."

Activation keywords that improve delegation reliability:

  • "use PROACTIVELY" — triggers delegation without explicit request
  • "Use immediately after [action]" — ties delegation to workflow events
  • "[domain] specialist" — narrows matching to relevant tasks

Effective descriptions combine activation triggers, domain scope, and temporal context — features that help the orchestrator match tasks to agents.

When This Backfires

Validation gates are slow or absent. Cascade routing depends on deterministic, cheap validators (tests, linters, type checkers). If the validation step takes longer than the cost difference between tiers, the cascade adds latency without saving money. Measure gate cost before committing to escalation-based routing.

Single-task pipelines. A three-tier routing system — whether by task type or by code health — adds configuration and coordination overhead. For pipelines with one task type and low invocation volume, a single capable model at a fixed tier is simpler and often cheaper when amortized over setup and maintenance cost.

Frequently-updated model rosters. Role-based routing breaks when a provider deprecates or renames a model tier. Teams without automated model-ID management (display names like haiku, capability caching with TTL) spend engineering time on breakage rather than shipping features.

High task interdependency. When tasks cannot be cleanly separated by complexity — for example, a refactor that requires reasoning at every file edit — routing exploration to a fast model and implementation to a capable one creates friction: the fast model's findings must be re-ingested by the capable model, adding tokens and latency.

Anti-Patterns

Default everything to the top tier. Safe but wasteful at scale.

Heavy agents for simple tasks. Decompose into lightweight agents.

Vague agent descriptions. "Helps with code" gives no delegation signal.

Example

The following Claude Code sub-agent configuration routes file exploration to a fast model while reserving the balanced model for implementation tasks. The agent descriptions use activation keywords to guide orchestrator delegation.

{
  "agents": [
    {
      "name": "explorer",
      "model": "claude-haiku-4-5",
      "description": "Use PROACTIVELY for any read-only codebase exploration: searching files, reading source, tracing call paths, listing directories. Use immediately after receiving a new task to understand the codebase before implementing.",
      "tools": ["Read", "Glob", "Grep"],
      "system": "You are a read-only exploration agent. Never write or modify files. Return findings as structured summaries."
    },
    {
      "name": "implementer",
      "model": "claude-sonnet-4-5",
      "description": "Use for writing, editing, or refactoring code once exploration is complete. Implementation specialist — invoke after the explorer agent has mapped relevant files.",
      "tools": ["Read", "Edit", "Write", "Bash"]
    },
    {
      "name": "architect",
      "model": "claude-opus-4-5",
      "description": "Use for architectural decisions, complex refactors spanning more than 5 files, or tasks requiring deep cross-cutting reasoning. Invoke sparingly.",
      "tools": ["Read", "Edit", "Write", "Bash"]
    }
  ]
}

The explorer description combines "Use PROACTIVELY" with "Use immediately after receiving a new task" — activation keywords that push the orchestrator to delegate exploration automatically, keeping Haiku on high-volume read-only work while Sonnet and Opus stay reserved for tasks that justify their cost.

Key Takeaways

  • Route tasks to models by complexity — exploration to fast, implementation to balanced, architecture to powerful.
  • Assign models by role (action, thinking, critique, vision, compact) with independent fallback chains for graceful degradation.
  • Classify agents by initialization token weight: lightweight (<3k), medium (10–15k), heavy (25k+). Prefer lightweight.
  • Craft agent descriptions with activation keywords to improve orchestrator delegation accuracy.
  • Use display names (haiku, sonnet, opus) rather than pinned model IDs to avoid silent breakage at model retirement.
  • Cascade routing (cheap model first, escalate on validation failure) approximates FrugalGPT-style savings without native tooling support.
Feedback