Delegation Threshold Calibration for Orchestrator Agents¶
Calibrate when an orchestrator hands work to a sub-agent versus finishes it inline — handoff cost and review tax can swamp the parallelism gain.
The delegation threshold is the orchestrator's standing rule for when a sub-task is worth handing off. Raise it and the orchestrator finishes more work directly; lower it and it fans out earlier. The lever is not a free knob — every handoff carries fixed coordination cost, an Anthropic-measured 15× token multiplier on true multi-agent topologies, and a review tax on whatever the sub-agent returns. Delegation pays back only when the parallelism or context-isolation gain exceeds those costs. Calibrating the threshold is a separate concern from dispatch mechanics (see async non-blocking sub-agent dispatch) and from the prior human-to-agent decision (see the delegation decision).
The cost side of delegation¶
Every sub-agent invocation adds three costs the orchestrator does not pay if it handles the work inline:
- Coordination overhead — context packing, output unpacking, and the wait between dispatch and return. GitHub's Copilot CLI team puts it directly: "every handoff adds coordination overhead, tool calls, and wait time".
- Token cost — Anthropic reports that "multi-agent systems use about 15× more tokens than chats" on their research workload. That multiplier is the cost of duplicating context across sub-agents and producing summaries the orchestrator can re-ingest.
- Review tax — every sub-agent output needs review, the same fixed cost the delegation decision names. Lowering the threshold multiplies the review tax; raising it spreads it across fewer, larger units.
Tran and Kiela's equal-budget study attributes the cost to the Data Processing Inequality: under fixed reasoning-token budgets, splitting a task across context-isolated sub-agents partitions the available information, and the partition is strictly lossy. Apparent multi-agent wins, they argue, often come from uncontrolled compute and context inflation rather than architectural superiority.
Signals that should move the threshold¶
GitHub's framing after their tuning project is operational: "keep simple discovery-and-edit tasks in the main agent, and reserve subagents for work that is broader, cross-cutting, or naturally parallelizable". That collapses to four signals you can read off a candidate sub-task.
| Signal | Lower the threshold (delegate more) | Raise the threshold (handle inline) |
|---|---|---|
| Parallelisability | Independent sub-tasks run concurrently | Sequential — sub-task B needs A's output |
| Isolatability | Context fits in one handoff, returns one self-contained answer | Hidden state shared with siblings or the orchestrator |
| Context already present | Orchestrator lacks the relevant files or facts | Orchestrator's context already holds the answer |
| Review cost vs execution cost | Verifying the sub-agent's output is cheaper than redoing it | Quicker to read the file and act than to write a handoff and verify a reply |
The "context already present" row matters because GitHub's tuning project found "overuse of exploration subagents when the handoff already contains enough context" as a top symptom of over-delegation. The same source flags "sequential delegation, where the main agent waits for a subagent instead of treating delegation as an opportunity for parallel work" — sub-agents should be "a parallelism tool, not a pause button".
Anthropic's coarse scaling rule sets a useful zero point: "Simple fact-finding requires just 1 agent with 3-10 tool calls, direct comparisons might need 2-4 subagents with 10-15 calls each, and complex research might use more than 10 subagents". Fact-finding sits below the threshold; complex research sits above it.
Structural preconditions¶
The threshold is the wrong primitive until two structural conditions hold, both drawn from Cognition's follow-up on what actually works in multi-agent systems:
- Writes stay single-threaded. Cognition found multi-agent systems work in practice only when "writes stay single-threaded" and additional agents contribute intelligence rather than actions. If sibling sub-agents can both write, threshold calibration will not save you.
- Context is shared, not split. Cognition's original critique still applies: "every time you split a task across multiple agents, you introduce a communication boundary and context gets lost between handoffs". Share the full agent trace, not a summary; if you cannot, raise the threshold sharply.
Without these, lowering the threshold amplifies error rather than throughput. DeepMind's 180-configuration sweep across five architectures and three model families found "unstructured multi-agent networks amplify errors up to 17.2×" compared with a single-agent baseline. Threshold tuning operates above this floor; it does not substitute for it.
Why It Works¶
The mechanism behind the cost is information-theoretic. Tran and Kiela attribute it to the Data Processing Inequality: under fixed reasoning-token budgets, partitioning a task across context-isolated sub-agents is "strictly lossy", so single-agent systems are more information-efficient when budgets are normalised. Calibrating the threshold is how the orchestrator stays on the right side of that trade — delegate only when parallelism gain or context-isolation gain visibly exceeds the lossy partition.
The production evidence is GitHub's A/B test on Copilot CLI after raising the delegation threshold and tightening handoff criteria: "a 23% reduction in tool failures per session, a 27% reduction in search tool failures, an 18% reduction in edit tool failures, a 5% improvement in P95 wait time", with "no quality regression". The trajectory analysis adds "15% reduction in failed subagent search calls" and "12% lower average subagent LLM duration per user". The win came from removing low-benefit handoffs, not from removing delegation.
When This Backfires¶
- Stateful coupling between sibling sub-tasks. If delegated work shares hidden state — open files, environment, intermediate decisions — the orchestrator pays handoff cost twice and still produces conflicting intermediate decisions. Cognition's "context gets lost between handoffs" critique applies in full.
- Small total task budget. When total work is sub-minute, the fixed handoff overhead per GitHub's tuning post dominates the parallel-speedup. The orchestrator is faster doing the work inline.
- Context already loaded in the orchestrator. "Overuse of exploration subagents when the handoff already contains enough context" was a top symptom in GitHub's tuning post — re-delegating a question whose answer is already in the orchestrator's context is pure overhead.
- Equal-budget regime where the orchestrator is competent. Tran and Kiela show that under matched token budgets, single-agent systems match or beat multi-agent on multi-hop reasoning across Qwen3, DeepSeek-R1-Distill-Llama, and Gemini 2.5. Lowering the threshold here gives away the equal-budget single-agent advantage.
- Unstructured fan-out without a centralised synthesis plane. DeepMind's 17.2× error amplification finding holds for flat-topology delegation. Lowering the threshold without a verifying orchestrator amplifies error, not throughput.
Example¶
GitHub described their Copilot CLI tuning in operational terms — symptoms first, then the threshold change, then the measured outcome. The symptoms were five: "unnecessary handoffs for simple tasks that the main agent could complete faster on its own"; "overuse of exploration subagents when the handoff already contains enough context"; "repeated or overlapping searches across the main agent and subagents"; "sequential delegation, where the main agent waits for a subagent instead of treating delegation as an opportunity for parallel work"; "failure-prone subagent paths, including stale file paths, moved files, incorrect relative paths, and workspace mismatches".
The threshold change was a single rule, restated for the orchestrator's planner: keep simple discovery-and-edit tasks in the main agent; reserve sub-agents for work that is broader, cross-cutting, or naturally parallelisable; treat handoffs as opportunities for parallelism, not as pauses. Every handoff carries a specification: "what the user asked, what is already known, what the subagent owns".
The measured outcome on the A/B was "a 23% reduction in tool failures per session, a 27% reduction in search tool failures, an 18% reduction in edit tool failures, a 5% improvement in P95 wait time" with no quality regression — a production result for raising the threshold once the symptom set was diagnosed.
Key Takeaways¶
- The delegation threshold is the orchestrator's calibration of when fan-out pays back — raise it when the orchestrator already holds context or the work is sequential, lower it when the work is broader, cross-cutting, or naturally parallelisable (GitHub Blog, 2026).
- Handoff cost is real and measurable — coordination overhead plus an Anthropic-reported 15× token multiplier on true multi-agent plus a review tax — so the threshold lives above a fixed cost floor.
- Calibration only matters above structural preconditions: single-threaded writes and shared context. Without them, lowering the threshold can amplify error up to 17.2×.
- This page sits above async non-blocking sub-agent dispatch (mechanics) and beside the delegation decision (human-to-agent) — orchestrator-to-sub-agent threshold tuning is the missing layer.