Skip to content

Cross-Component Interference in Agent Scaffolds

Stacking planning, memory, retrieval, self-reflection on tool use rarely wins: a full-factorial study shows the maximally-equipped agent losing to smaller subsets, planning and memory worst.

The Default That Loses

Liu (2026) ran a full factorial over all 32 subsets of {Planning, Tools, Memory, Self-Reflection, Retrieval} on HotpotQA, GSM8K, and SWE-bench Lite. The "All-In" agent bundling every component is consistently suboptimal:

  • HotpotQA at 8B: single-tool agent beats All-In by 32% (F1 0.233 vs 0.177, p=0.023).
  • GSM8K: a 3-component subset beats All-In by 79% (0.43 vs 0.24, p=0.010).
  • 30-50% of larger configurations underperform smaller subsets.
  • Submodularity violated in 56.3% of cases — greedy "add until marginal turns negative" selection is provably unreliable.

Worst Offenders

Per-component disruption rate across CCI tasks (Liu, 2026):

Component Disrupts CCI tasks Shapley value
Planning 84% -0.029 (95% CI [-0.055, -0.003]) — significantly negative
Memory 68% -0.016 on HotpotQA
Retrieval 68% task-dependent
Self-Reflection 58% task-dependent
Tool Use captures 70% of total scaffold value

Planning and memory are suspect by default. Tool use is the only component that pays for itself across tasks.

Why It Happens

Components share one substrate: the model's context window and attention budget. Each injects its own tokens — planning traces, retrieved passages, reflection notes, memory excerpts — competing for attention with task-relevant content. Same mechanism as attention dilution.

A main-effects model fits R^2=0.916, beating pairwise interaction models (Liu, 2026) — most damage is per-component context cost, not destructive pairs. One positive triple exists (Tool Use + Self-Reflection + Retrieval), so interactions are real when they occur.

graph TD
    A[Add a component] --> B[More tokens injected]
    B --> C[Attention budget split]
    C --> D[Less weight on task-critical content]
    D --> E[Performance drops below smaller subset]

Scale Qualifies, Does Not Eliminate

The All-In gap shrinks with model strength — 32% at 8B, 19% at 70B, ~0% at Claude Haiku — but All-In still never beats the best subset at any tested scale (Liu, 2026). Frontier models tolerate over-stacking; they do not benefit from it.

The scaffold is the dominant lever, which makes it the dominant way to lose: harness changes alone swing Terminal Bench 2.0 by 14 points with no model swap (LangChain), and on SWE-bench Pro the scaffold drives a 22+ point swing versus ~1 point for model swaps (particula.tech).

Optimal count k varies by task: k=1 on HotpotQA, k*=3 on GSM8K. No universal right number.

When Over-Stacking Is Defensible

  • Frontier model, no ablation budget — the gap is small at Haiku-scale and above; ship All-In and prune later when a 32-cell ablation is infeasible.
  • Heterogeneous task distributions — traffic mixing math-like (k=3) and retrieval-like (k=1) tasks cannot be served by one fixed minimal subset; per-task routing may dominate.
  • Binary failure mode — if missing a component means task impossible rather than suboptimal, keep it even at average performance cost.

These are exceptions. The default failure mode is scaffold inflation that nobody measured.

Example

Before — maximally-equipped HotpotQA agent at 8B:

# All-In: planning + tools + memory + self-reflection + retrieval
agent = Agent(
    model="llama-3.1-8b",
    components=[Planner(), Tools(), Memory(), SelfReflection(), Retrieval()],
)
# F1 = 0.177 on HotpotQA (Liu, 2026, Table 2)

After — single-component agent on the same task:

# Tools-only — beats All-In by 32%
agent = Agent(
    model="llama-3.1-8b",
    components=[Tools()],
)
# F1 = 0.233 on HotpotQA, p=0.023 vs All-In (Liu, 2026)

Four components were removed — Planning, Memory, Self-Reflection, Retrieval — and F1 rose 32%. The win is not a clever combination; it is removing the components that disrupted 84% and 68% of CCI tasks (Planning and Memory) (Liu, 2026).

How To Avoid It

  • Ablate before shipping. At minimum run a leave-one-out sweep. One measured component per release beats four at once.
  • Default-suspect Planning and Memory. Worst disruption rates; require positive evidence to include.
  • Anchor on Tool Use. It captures 70% of scaffold value; build outward from it.
  • Measure on hard tasks. Easy tasks have high baseline accuracy that hides interference.
  • Re-ablate per model. Components harmful at 8B can help at 70B; pin the scaffold to the model and re-run on swaps.

Key Takeaways

  • The maximally-equipped agent is rarely the optimum — 30-50% of larger configurations lose to smaller subsets in a full-factorial study
  • Planning and memory are the worst offenders, disrupting 84% and 68% of cross-component-interference tasks
  • The mechanism is per-component additive context cost, not specific destructive pairs — main-effects models fit the data with R^2=0.916
  • The All-In gap shrinks at frontier scale but never inverts — frontier models tolerate over-stacking, they do not benefit from it
  • Optimal component count is task-dependent (k=1 to k=3 in this study); there is no universal "right number"
  • Default to ablation before shipping, treat Planning and Memory as suspect-by-default, and re-ablate per model
Feedback