Skip to content

Token Engineering

Token engineering gets the same result for fewer, cheaper tokens — routing to the right model and trimming each call, without degrading output.

Cost is now a first-order constraint on agentic coding, but the techniques that control it are scattered — model routing, effort budgets, prompt compression, caching discipline, token-efficient output, small-model offload, batch scheduling. Token engineering is the name for that cluster: the deliberate practice of spending fewer expensive tokens at the wrong time, while holding output quality fixed.

It cuts across the site. The canonical treatment of each technique still lives in its home discipline — context engineering, agent design, tool engineering, observability. This section owns the pages whose primary subject is token cost, and crosswalks the rest under one frame so you can navigate "how do I cut token spend without losing quality?" as a single topic.

What token engineering is — and isn't

  • It is the optimisation goal: same task outcome, fewer/cheaper tokens. Every technique below carries an implicit "without degrading the end result" clause.
  • It is not context engineering. Context engineering decides what information enters the window for quality and reliability; token engineering is the cost-and-efficiency lens over those decisions. They overlap (lean context is cheaper) but answer different questions.
  • It is not generic cost-performance. The cost-performance tag spans latency, throughput, and infra; token engineering is specifically about the token as the unit of spend.
  • The quality constraint is the whole point. Cutting tokens can backfire — see Token Preservation Backfire, the guardrail every technique here must respect.

The crosswalk

The frame is four "rights" — the right model, the right token, the right cache, at the right time — plus three supporting levers (effort scaling, small-model offload, measurement).

Right model — routing

Send each task to the cheapest model and tier that still passes, escalating only on failure.

Right token — lean context and output

Shrink what each call has to carry, on both the input and output sides.

Right cache — caching discipline

Structure prompts so the cacheable prefix stays stable and hits.

Right time — temporal routing

Route non-urgent work into cheaper capacity windows. Batch APIs are the concrete cost primitive: Anthropic's Message Batches and OpenAI's Batch API both run jobs asynchronously at a 50% discount, completing within 24 hours — typically under an hour for Anthropic (Anthropic — Message Batches; OpenAI — Batch API). Work that can wait — overnight evals, doc refreshes, bulk refactors, research passes — belongs in those windows.

This axis is the freshest and least covered today — see the spin-off issues for deeper pages on eval-gated scheduling.

Effort and budget scaling

Spend reasoning compute in proportion to task difficulty, not uniformly.

Small-model offload

Push verbose intermediate work to a cheaper model and return a compact result.

Measurement and visibility

You cannot reduce what you do not measure — instrument spend before cutting it.

Feedback