Token Engineering¶

Token engineering gets the same result for fewer, cheaper tokens — routing to the right model and trimming each call, without degrading output.

Cost is now a first-order constraint on agentic coding, but the techniques that control it are scattered — model routing, effort budgets, prompt compression, caching discipline, token-efficient output, small-model offload, batch scheduling. Token engineering is the name for that cluster: the deliberate practice of spending fewer expensive tokens at the wrong time, while holding output quality fixed.

It cuts across the site. The canonical treatment of each technique still lives in its home discipline — context engineering, agent design, tool engineering, observability. This section owns the pages whose primary subject is token cost, and crosswalks the rest under one frame so you can navigate "how do I cut token spend without losing quality?" as a single topic.

What token engineering is — and isn't¶

It is the optimisation goal: same task outcome, fewer/cheaper tokens. Every technique below carries an implicit "without degrading the end result" clause.
It is not context engineering. Context engineering decides what information enters the window for quality and reliability; token engineering is the cost-and-efficiency lens over those decisions. They overlap (lean context is cheaper) but answer different questions.
It is not generic cost-performance. The cost-performance tag spans latency, throughput, and infra; token engineering is specifically about the token as the unit of spend.
The quality constraint is the whole point. Cutting tokens can backfire — see Token Preservation Backfire, the guardrail every technique here must respect.

The crosswalk¶

The frame is four "rights" — the right model, the right token, the right cache, at the right time — plus three supporting levers (effort scaling, small-model offload, measurement).

Right model — routing¶

Send each task to the cheapest model and tier that still passes, escalating only on failure.

Routing Decision Framework — the selection map over the routing pages below: pick by dominant signal (complexity, blast radius, latency, cost)
Cost-Aware Agent Design: Route by Complexity, Not Habit — the cornerstone: match model capability to task complexity, escalate on validation failure
Gateway Model Routing — one gateway knob drives both the inference target and the model picker
Auto Model Selection — hand per-task model choice to the harness
Cross-Vendor Competitive Routing — race competing vendors, gate on the winner
Model-Neutral Agent Architecture — keep the agent portable so routing stays a config decision
Multi-Shape BYOK Provider — bring-your-own-key routing across provider shapes
Parsimonious Agent Routing — one delegation plan that jointly optimises decompose, worker, and budget
Self-Healing Tool Routing — route around failing tools before they burn retries

Right token — lean context and output¶

Shrink what each call has to carry, on both the input and output sides.

Token-Efficient Tool Design — each tool call injects the minimum tokens for the next decision
Token-Efficient Code Generation — idiomatic structure beats "be concise" prompting
Tokenizer Swap Tax — budgeting for migrations that change token counts under flat per-token pricing
Prompt Compression — maximise signal per token in instructions
Semantic Density Optimization — raise task-relevant tokens per byte in the codebase
Context Budget Allocation — treat context as a finite budget across sources

Right cache — caching discipline¶

Structure prompts so the cacheable prefix stays stable and hits.

Prompt Caching: Architectural Discipline for Agents — design for cache hits and cross-provider cache economics
Static Content First to Maximize Cache Hits — order stable content first so the prefix caches
Exclude Dynamic System Prompt Sections — move per-machine context out so fleets share one cache entry
KV Cache Invalidation in Local Inference — attribution headers that silently break the KV cache

Right time — temporal routing¶

Route non-urgent work into cheaper capacity windows. Batch APIs are the concrete cost primitive: Anthropic's Message Batches and OpenAI's Batch API both run jobs asynchronously at a 50% discount, completing within 24 hours — typically under an hour for Anthropic (Anthropic — Message Batches; OpenAI — Batch API). Work that can wait — overnight evals, doc refreshes, bulk refactors, research passes — belongs in those windows.

Temporal Token Routing: Batch and Flex Tiers for Non-Urgent Work — the right-time decision: which workload class belongs in batch, flex, or the synchronous tier
Idle-Time Speculative Planning — use idle compute to pre-plan likely next steps
Background TODO Agent — defer non-urgent work to a background agent
Programmatic Cloud Agent Dispatch — schedule deferred agent runs into cheaper capacity

This axis is the freshest and least covered today — see the spin-off issues for deeper pages on eval-gated scheduling.

Effort and budget scaling¶

Spend reasoning compute in proportion to task difficulty, not uniformly.

Reasoning Budget Allocation — the reasoning sandwich: heavy planning and verification, light execution
Heuristic-Based Effort Scaling — encode effort rules in the system prompt
Per-Call Budget Hints on Tool Invocations — raise the cap on one dense, infrequent call
Per-Tool Extended Reasoning Opt-In — tool-call-scoped reasoning budgets

Small-model offload¶

Push verbose intermediate work to a cheaper model and return a compact result.

Specialized Small Language Models as Agent Sub-Tools — an SLM absorbs raw bytes; the orchestrator never sees them
Compositional Skill Routing — route across a large skill library without loading it all

Measurement and visibility¶

You cannot reduce what you do not measure — instrument spend before cutting it.

Token-Cost Profiling and Reduction for Always-On Agentic Workflows — the instrument-attribute-fix-verify loop
Cost-Quality Pareto Measurement for Agent Configurations — plot each configuration on the (cost, quality) frontier so quality-trading downgrades are visible
Code Cleanliness as an Agent Cost Lever — cleaner code cut token use 7-8% with no pass-rate change
Per-Plugin Token-Cost Attribution — attribute spend down to the plugin
BYOK Model Token Visibility — in-IDE token and context telemetry for BYOK routes

Concept Map — all site content grouped by theme
Context Engineering — the canonical home for lean-context techniques
Agent Design — routing, effort, and offload patterns live here
Token Preservation Backfire — the quality guardrail this section must respect