Skip to content

Over-Orchestrated Agent Architecture (Prefer the Simplest That Works)

Multi-agent architecture adopted before a single loop is tried trades reliability for moving parts; handoff context loss usually costs more than coordination buys.

Over-orchestration is the per-task design error of reaching for an orchestrator, sub-agents, or a multi-agent topology before the simplest loop that meets the requirement has been tried and found wanting. The default should invert: start with one agent holding the full context, and add structured complexity only on the narrow class of tasks where the benchmarks show it pays back its token cost (Anthropic Engineering, Building Effective Agents).

When Multi-Agent Earns Its Complexity

Three boundary conditions justify the orchestration cost. Outside these, default to a single loop.

  • Breadth-first parallel research with multiple independent sub-queries that genuinely fit a fan-out. Anthropic's research multi-agent system outperformed single-agent Claude Opus 4 by 90.2% on their internal eval — but only on breadth-first queries pursuing independent directions simultaneously, and at roughly 15× the token cost of chat (Anthropic Engineering, Multi-Agent Research System).
  • Information exceeds a single context window so parallel sub-agents are buying capacity the monolith cannot. Anthropic names this as a precondition: multi-agent excels at "tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools" (Anthropic Engineering, Multi-Agent Research System).
  • Evaluator-optimizer loops with explicit pass/fail criteria where a separate critic measurably tightens output. Anthropic's foundational guide recommends adding this kind of structured complexity "only when it demonstrably improves outcomes" (Anthropic Engineering, Building Effective Agents).

Coding is named as a poor fit even by the multi-agent advocates: "most coding tasks involve fewer truly parallelizable tasks than research" (Anthropic Engineering, Multi-Agent Research System).

The Pattern

A team has a working single-agent loop or has not yet built one, and adds an orchestrator with sub-agents for triage, planning, retrieval, and execution because the design feels more "production-grade". Each sub-agent owns part of the task and hands context forward through a summary. Coordination logic, retries, and message schemas accumulate. The system passes its happy-path demos and then fails unpredictably in production, usually with the symptom that one sub-agent acts on stale or incomplete context from another.

The architecture is often shaped by org structure rather than task structure. As Sierra's Zack Reneau-Wedeen puts it: "If you want a multi-agent system so that one team can work on one agent and one team can work on another agent, then you're shipping your org chart" (LangChain Blog, 2026-06-25).

Why It Works

Multi-agent fails on most tasks because every handoff loses the context the receiving agent did not see. The receiver has to reconstruct it (expensive and lossy) or act on a degraded picture (incorrect). Cognition's "Don't Build Multi-Agents" post names the mechanism: in naive multi-agent setups "sub-agents have no context of each other's work", and "failure generally boils down to missing context within the system" (Cognition, Don't Build Multi-Agents, June 2025).

Compute-equalised benchmarks confirm the mechanism. With reasoning-token budget held constant, single-agent systems "consistently match or outperform" multi-agent ones on multi-hop reasoning across the Qwen3, DeepSeek-R1, and Gemini 2.5 families; the earlier multi-agent gains reflect "unaccounted computation and context effects rather than inherent architectural benefits" (Tran & Kiela, 2026, arxiv:2604.02460). Shopify's ICML 2025 Sidekick lessons reach the same conclusion from production: "avoid multi-agent architectures early — simple single-agent systems can handle more complexity than you might expect" (Shopify Engineering, 2025).

Signals of Over-Engineering

  • Sub-agents that mostly hand off. Each step is "summarise and pass forward" with no distinct capability that the lead agent lacks.
  • Coordination code outweighs task code. Message schemas, retry policy, and arbitration take more lines than the work the system does.
  • The topology matches the org chart. Each agent maps to a team or vendor rather than a task that genuinely decomposes.
  • Variance dominates accuracy. Across 260 multi-agent configurations and six benchmarks, the same architecture swings from +80.8% on decomposable financial reasoning to -70.0% on sequential planning (Kim et al., 2025, arxiv:2512.08296) — a configuration that wobbles by 150 points across tasks is being run on the wrong task.

When This Backfires

  • Genuinely breadth-first research at production scale. Inside the Anthropic envelope (independent sub-queries, breadth-first, information exceeding one context window), defaulting to single-agent leaves the documented 90.2% accuracy lift on the table (Anthropic Engineering, Multi-Agent Research System).
  • Decomposable parallel work above the coordination threshold. Multi-agent topologies show measured gains specifically on decomposable financial reasoning, multi-vendor extraction, and batch processing (Kim et al., 2025, arxiv:2512.08296).
  • Single-agent baselines already low. Kim et al. observe that coordination yields diminishing returns once single-agent baselines exceed roughly 45% on the benchmarks studied — below that threshold, the multi-agent topology can earn its overhead.
  • Evaluator-optimizer loops with measurable criteria. When iterative refinement provides "measurable value" against explicit pass/fail criteria, separating critic from executor improves outcomes (Anthropic Engineering, Building Effective Agents).

Example

Before — orchestrator-shaped by the org chart, not the task:

Customer-support agent system, v2:
- Triage agent: classifies intent, summarises into a TaskPlan
- Knowledge agent: retrieves docs against the TaskPlan summary
- Task agent: executes the action, given the TaskPlan and retrieved docs
- Response agent: composes the final reply from the task outcome
4 LLM calls minimum per turn. Each agent runs from a summary written
by the previous one.

Three teams own three agents. A bug reported as "task agent ignores customer's stated time zone" turns out to be triage agent dropping the time zone from the summary. The task agent never saw it.

After — one agent holds the conversation, tools do the specialised work:

Customer-support agent system, v3:
- Single agent per brand: full conversation history, the brand voice,
  every available tool (lookup, action, escalation)
- Tools: docs.search(), order.update(), human.escalate(), ...
1 LLM call per turn. The agent sees the whole conversation; no summary
sits between it and the customer's stated time zone.

Sierra runs this architecture across Fortune-20 customer agents and frames it explicitly: "the agent is the brand's voice and knows the full customer history, the full context of the conversation, and the full set of things it can do" (LangChain Blog, 2026-06-25).

Key Takeaways

  • Default to the simplest loop; reach for multi-agent only on breadth-first parallel research, tasks that exceed a single context window, or evaluator-optimizer loops with measurable criteria.
  • Context loss at every handoff is the mechanism — sub-agents acting on summaries instead of the underlying conversation produce internally consistent but wrong output.
  • A topology that mirrors the org chart is the signal that org structure, not task structure, shaped the architecture.
  • Compute-equalised benchmarks find single-agent matches or beats multi-agent on most reasoning tasks; pre-equalised wins reflect unaccounted compute, not architecture.
Feedback