Token Engineering¶
Token engineering gets the same result for fewer, cheaper tokens — routing to the right model and trimming each call, without degrading output.
Cost is now a first-order constraint on agentic coding, but the techniques that control it are scattered — model routing, effort budgets, prompt compression, caching discipline, token-efficient output, small-model offload, batch scheduling. Token engineering is the name for that cluster: the deliberate practice of spending fewer expensive tokens at the wrong time, while holding output quality fixed.
It cuts across the site. The canonical treatment of each technique still lives in its home discipline — context engineering, agent design, tool engineering, observability. This section owns the pages whose primary subject is token cost, and crosswalks the rest under one frame so you can navigate "how do I cut token spend without losing quality?" as a single topic.
What token engineering is — and isn't¶
- It is the optimisation goal: same task outcome, fewer/cheaper tokens. Every technique below carries an implicit "without degrading the end result" clause.
- It is not context engineering. Context engineering decides what information enters the window for quality and reliability; token engineering is the cost-and-efficiency lens over those decisions. They overlap (lean context is cheaper) but answer different questions.
- It is not generic cost-performance. The
cost-performancetag spans latency, throughput, and infra; token engineering is specifically about the token as the unit of spend. - The quality constraint is the whole point. Cutting tokens can backfire — see Token Preservation Backfire, the guardrail every technique here must respect.
The crosswalk¶
The frame is four "rights" — the right model, the right token, the right cache, at the right time — plus three supporting levers (effort scaling, small-model offload, measurement).
Right model — routing¶
Send each task to the cheapest model and tier that still passes, escalating only on failure.
- Routing Decision Framework — the selection map over the routing pages below: pick by dominant signal (complexity, blast radius, latency, cost)
- Cost-Aware Agent Design: Route by Complexity, Not Habit — the cornerstone: match model capability to task complexity, escalate on validation failure
- Gateway Model Routing — one gateway knob drives both the inference target and the model picker
- Auto Model Selection — hand per-task model choice to the harness
- Cross-Vendor Competitive Routing — race competing vendors, gate on the winner
- Model-Neutral Agent Architecture — keep the agent portable so routing stays a config decision
- Multi-Shape BYOK Provider — bring-your-own-key routing across provider shapes
- Parsimonious Agent Routing — one delegation plan that jointly optimises decompose, worker, and budget
- Self-Healing Tool Routing — route around failing tools before they burn retries
Right token — lean context and output¶
Shrink what each call has to carry, on both the input and output sides.
- Token-Efficient Tool Design — each tool call injects the minimum tokens for the next decision
- Token-Efficient Code Generation — idiomatic structure beats "be concise" prompting
- Tokenizer Swap Tax — budgeting for migrations that change token counts under flat per-token pricing
- Prompt Compression — maximise signal per token in instructions
- Semantic Density Optimization — raise task-relevant tokens per byte in the codebase
- Context Budget Allocation — treat context as a finite budget across sources
Right cache — caching discipline¶
Structure prompts so the cacheable prefix stays stable and hits.
- Prompt Caching: Architectural Discipline for Agents — design for cache hits and cross-provider cache economics
- Static Content First to Maximize Cache Hits — order stable content first so the prefix caches
- Exclude Dynamic System Prompt Sections — move per-machine context out so fleets share one cache entry
- KV Cache Invalidation in Local Inference — attribution headers that silently break the KV cache
Right time — temporal routing¶
Route non-urgent work into cheaper capacity windows. Batch APIs are the concrete cost primitive: Anthropic's Message Batches and OpenAI's Batch API both run jobs asynchronously at a 50% discount, completing within 24 hours — typically under an hour for Anthropic (Anthropic — Message Batches; OpenAI — Batch API). Work that can wait — overnight evals, doc refreshes, bulk refactors, research passes — belongs in those windows.
- Temporal Token Routing: Batch and Flex Tiers for Non-Urgent Work — the right-time decision: which workload class belongs in batch, flex, or the synchronous tier
- Idle-Time Speculative Planning — use idle compute to pre-plan likely next steps
- Background TODO Agent — defer non-urgent work to a background agent
- Programmatic Cloud Agent Dispatch — schedule deferred agent runs into cheaper capacity
This axis is the freshest and least covered today — see the spin-off issues for deeper pages on eval-gated scheduling.
Effort and budget scaling¶
Spend reasoning compute in proportion to task difficulty, not uniformly.
- Reasoning Budget Allocation — the reasoning sandwich: heavy planning and verification, light execution
- Heuristic-Based Effort Scaling — encode effort rules in the system prompt
- Per-Call Budget Hints on Tool Invocations — raise the cap on one dense, infrequent call
- Per-Tool Extended Reasoning Opt-In — tool-call-scoped reasoning budgets
Small-model offload¶
Push verbose intermediate work to a cheaper model and return a compact result.
- Specialized Small Language Models as Agent Sub-Tools — an SLM absorbs raw bytes; the orchestrator never sees them
- Compositional Skill Routing — route across a large skill library without loading it all
Measurement and visibility¶
You cannot reduce what you do not measure — instrument spend before cutting it.
- Token-Cost Profiling and Reduction for Always-On Agentic Workflows — the instrument-attribute-fix-verify loop
- Cost-Quality Pareto Measurement for Agent Configurations — plot each configuration on the (cost, quality) frontier so quality-trading downgrades are visible
- Code Cleanliness as an Agent Cost Lever — cleaner code cut token use 7-8% with no pass-rate change
- Per-Plugin Token-Cost Attribution — attribute spend down to the plugin
- BYOK Model Token Visibility — in-IDE token and context telemetry for BYOK routes
Related¶
- Concept Map — all site content grouped by theme
- Context Engineering — the canonical home for lean-context techniques
- Agent Design — routing, effort, and offload patterns live here
- Token Preservation Backfire — the quality guardrail this section must respect