Skip to content

Token-Cost Profiling and Reduction for Always-On Agentic Workflows

An instrument-attribute-fix-verify loop that turns recurring agentic workflows into a measurable cost surface, with named levers and frequency-weighted preconditions.

When the Loop Pays Back

The instrument-attribute-fix-verify loop is worth the engineering hours under three preconditions; outside them, "accept the API bill" is the rational default.

  • High-frequency runs. GitHub's published case study shows a 62% reduction on Auto-Triage Issues (6.8 runs/day) dominated a 19% reduction on Daily Compiler Quality (once a day) by absolute dollars — frequency is the multiplier, not per-run cost (GitHub Blog: Improving token efficiency in GitHub Agentic Workflows).
  • Stable prompts and tool sets. Profiling before the workflow stabilises optimises a moving target; each prompt edit or MCP-server upgrade invalidates prior attribution. The same auditor's public report for 2026-03-02 recorded a 37% aggregate drop and a per-run rise from 430–715K to 1.39M tokens — workflows kept getting more complex, swamping the optimisation.
  • Downstream measurement of output behaviour. Input-side optimisations that ignore output can backfire: in a pre-registered trial of prompt compression for task orchestration, aggressive compression at r=0.2 reduced input tokens 62% but raised total cost 1.8% because the model compensated with longer responses (Prompt Compression in Production Task Orchestration, 2026).

If a workflow is sub-daily, the prompt is in active iteration, or there is no output-side metric, prefer the cheaper moves — switch to a cheaper model class for the obvious wins, enable prompt caching on the static prefix, and revisit when the workflow stabilises.

The Three Structural Costs

Always-on workflows accumulate three costs that aren't visible at the per-invocation level. Each fix in the loop targets one of these mechanisms.

Cost mechanism What it looks like Lever that addresses it
Tool-definition payload re-sent every turn 5 MCP servers × 30 tools ≈ 30–60K tokens of metadata per turn, 25–30% of a 200K-token context (Junia AI: MCP Context Window Problem; upstream claude-code #20421) Prune the manifest; load tools lazily
Deterministic data-gathering inside the LLM loop gh issue view, label scans, diff retrieval — each requires an LLM round-trip to decide-call-receive Move to a pre-agentic CLI step that writes a workspace artifact
Frequency-multiplied small inefficiencies A 5% per-run waste on 100 runs/day is 5 runs/day of pure overhead Cost-weight every metric by runs/day before prioritising

The Loop

flowchart LR
    A[Instrument] --> B[Attribute]
    B --> C[Fix]
    C --> D[Verify]
    D --> B

Layer 1: Instrument

Capture every API call in a normalized JSONL artifact regardless of agent framework. GitHub's implementation routes all provider traffic through an API proxy that writes a token-usage.jsonl per run containing input tokens, output tokens, cache-read tokens, cache-write tokens, model, provider, and timestamps (GitHub Blog). The proxy matters because each agent framework exposes usage in a different shape — a per-call schema lets one auditor read across them.

Where teams already run OpenTelemetry for AI Agent Observability, the same data lands on gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.operation.name with parent/child span trees attaching tool calls to the LLM call that triggered them (OpenTelemetry GenAI semantic conventions). OTel is the cross-vendor surface when Claude Code, Copilot, and Cursor run side by side.

Layer 2: Attribute

Raw token counts mislead because output tokens cost roughly 4× input and models differ in price per token. GitHub's Effective Tokens (ET) metric collapses both:

ET = m × (1.0 × input + 0.1 × cache_read + 4.0 × output)
where m = 0.25 (Haiku), 1.0 (Sonnet), 5.0 (Opus)

The 4× output weight matches API pricing; the 0.1× cache-read weight matches the 90% discount on prompt-cache reads (GitHub Blog). The same auditor then aggregates by workflow, flags anomalous runs, and surfaces the most expensive jobs — the 2026-03-02 Daily Copilot Token Consumption Report tracks high-cost workflows, run frequency, process inefficiency, and operational overhead as four categories.

Prioritise by ET/run × runs/day, not ET/run. The published cuts on incremental indexing (70–80% projected) and CI Failure Doctor deduplication (40–60% projected) ranked highest precisely because both workflows ran many times a day (gh-aw Discussion #19197).

Layer 3: Fix

Five levers, ordered by yield in the GitHub case study:

  • MCP tool pruning. Tool manifests add 10–15 KB per turn even when unused. GitHub's Smoke Claude went 40 → 13 tools and dropped 59% combined with a Haiku swap (GitHub Blog). Cross-reference the manifest against the actual call log — if a tool never appears in token-usage.jsonl, it shouldn't be in the manifest. The tool-output-token-cost audit runbook gives the per-tool sizing heuristic.
  • Pre-agentic CLI substitution. Move deterministic reads out of the LLM loop. Auto-Triage saved 62% by running gh commands before the agent started and writing the result to a workspace file the agent read directly — no decide-call-receive round-trip (GitHub Blog).
  • Relevance gating. Skip the LLM entirely for inputs the workflow doesn't apply to. Security Guard dropped 43% by adding a cheap upstream check that bypasses the model for non-security PRs (GitHub Blog).
  • Cheaper-model routing for narrow steps. Per Cost-Aware Agent Design, validation-cheap steps cascade from a fast model with deterministic-gate escalation. Combine with prompt caching: cache writes cost 1.25×, cache reads cost 0.1× — a 10K-token static prefix reused 10 times costs 22,500 vs 110,000 uncached, a 79% reduction (min prefix 1,024–4,096 tokens depending on model; 5-min TTL refreshed at no cost on each hit) (Anthropic Prompt Caching docs).
  • Configuration repair. One GitHub workflow hit a 64-turn fallback loop because bash patterns blocked the tool it needed (GitHub Blog). Misconfiguration shows up in the auditor as anomalously high per-run cost — investigate before optimising the average.

Layer 4: Verify

A fix that lowers input tokens but raises output tokens has not saved money. Re-run the workflow set after every change and confirm ET trends down both at the workflow level and the aggregate. The pre-registered orchestration trial showed light compression (r=0.8) raising costs 14.1% from output expansion alone — without an output-side metric the regression is invisible (Prompt Compression in Production Task Orchestration).

GitHub closes the loop with two agentic workflows: a Daily Token Usage Auditor that aggregates and ranks; a Daily Token Optimiser that reads the source plus recent logs and opens a GitHub issue proposing a specific fix (GitHub Blog). The optimiser is itself an always-on workflow — apply the same preconditions before running it.

Triggers and Constraints

  • Auditor: daily schedule, read-only access to token-usage.jsonl archives and source workflow files. Authority bound to opening GitHub issues only — no write access to production workflow configs.
  • Optimiser: triggered by an auditor-flagged issue. May read logs and propose source changes as a PR. Authority bound to one PR per issue; merging is human-gated to catch regressions the optimiser cannot see (quality on the routed-cheap path, for example).
  • Proxy / OTel exporter: always-on alongside the workflow itself. Failure must not block the workflow — the loop tolerates missing data points, not blocked runs.

Why It Works

The three structural costs are invisible inside one run and only emerge against aggregated history — the proxy, normalized log, and ET metric close that attribution gap. Each named lever maps one-to-one to a cost mechanism, which is the same just-in-time-loading and stable-prefix-reuse pattern the broader context-engineering literature names for long-running agents, applied at the workflow loop rather than the per-call boundary (Anthropic: Effective Context Engineering; Anthropic Prompt Caching). The loop converges because the optimiser closes the same data path the auditor opened — any regression surfaces on the next day's report.

When This Backfires

The three preconditions above name the dominant failure modes; four additional traps surface during execution.

  • Cheaper-model routing without a quality gate. Smoke Claude saved 59% with a Haiku swap, but if the task starts failing, retries plus human triage exceed the saving. Pair every routing change with a deterministic check — see Cost-Aware Agent Design for the cascade-and-validate pattern.
  • Sparse data, noisy attribution. The auditor's anomaly detection needs enough runs per workflow to separate genuine waste from variance. On workflows with fewer than ~30 runs per week, anomaly flags are likely false positives — increase the aggregation window or skip the workflow.
  • Tool-pruning past the floor. Removing tools the agent actually needs causes failures or wrong-tool selection from a similar-named remaining set (Junia AI: MCP Context Window Problem). Drive pruning from the actual call log, not intuition about what "should" be unused.
  • Frontier-model price compression. Provider prices fall meaningfully year over year; this year's 19% saving may be smaller than next year's price drop. The loop pays back when the workflow set is large enough that even price-adjusted savings dominate engineering cost.

Key Takeaways

  • The loop pays back only on high-frequency, stable workflows with output-side measurement; below that bar, accept the API bill.
  • Three structural costs drive always-on workflow spend: tool-definition payload, deterministic LLM round-trips, and frequency-multiplied small inefficiencies.
  • Prioritise by ET/run × runs/day, not ET/run — frequency is the multiplier in every published case.
  • Verify every fix against a metric that includes output tokens; input compression that ignores output regresses cost.

Multi-tool Coverage

The instrumentation surface differs by tool; the loop is tool-agnostic.

Feedback