Skip to content

Decomposing Agent Output Variability by Layer (Sampling vs Orchestration State)

Run-to-run agent variability has at least three distinct layers — separate them before picking a mitigation, because lowering temperature does not fix an orchestration-state cascade.

This pattern pays off only when three conditions hold: the agent runs a real multi-step orchestration loop (not a single model call), you control enough of the stack to act on the attribution, and the mitigation cost is justified by the resulting decision. Single-shot or hosted-API-only workloads should skip to pass@k and pass^k metrics and accept the aggregate spread.

When those conditions hold, decomposition is the framework Hydari & Iqbal (2026) propose for validating non-deterministic coding agents: trace the variability to the layer where the mitigation lives, instead of tuning whichever knob is most visible [Source: Hydari & Iqbal, The Token Not Taken].

The Three Layers

Layer What varies What fixes it
Intrinsic (token sampling) Per-step stochastic selection over the next-token distribution at temperature > 0 Lower temperature, fix the seed (where supported), greedy decode
Extrinsic (infrastructure) Floating-point reduction order across hardware, kernels, and server batch size — drifts even at temperature 0 Batch-invariant kernels, pinned hardware, fixed-batch inference
Orchestration state Across-step accumulation of tool outputs, errors, and context that conditions every subsequent step Tighter system prompt, deterministic tools, narrower state surface, retries with reset

The intrinsic layer is the one practitioners reach for first because temperature is the most visible knob. The extrinsic layer is invisible from the API surface: Thinking Machines (2025) showed that even at temperature 0 hosted endpoints drift because the forward pass is not batch-invariant — server batch size changes the floating-point reduction order in normalisation, matmul, and attention. The orchestration-state layer is the one that compounds: each step's intrinsic noise becomes the next step's deterministic input.

Why It Works

A single trajectory cannot distinguish the layers because each step's intrinsic noise becomes the next step's deterministic state. The paper's mechanism: "sampling introduces stochasticity at each token; when agents iterate, this compounds across steps because state encodes all prior decisions" [Source: Hydari & Iqbal]. Holding one layer constant while varying the others gives you the attribution:

  • Isolate intrinsic: fix the system prompt and the tool sequence; vary the seed (or run many times at fixed temperature). Variance that remains is sampling-driven.
  • Isolate extrinsic: fix the prompt, run at temperature 0, and vary the server-side batch size (or compare different inference backends). Variance that remains is infrastructure-driven.
  • Isolate orchestration state: fix the prompt and use greedy decoding; perturb the tool outputs or context order between runs. Variance that remains is state-driven.

The independent agentic-eval study found single-run pass@1 standard deviations exceeding 1.5 percentage points even at temperature 0, "because trajectories diverge early, often within the first few percent of tokens, and these small differences cascade into different solution strategies" [Source: arXiv:2602.07150]. That cascade is the orchestration-state layer; the early divergence is the intrinsic layer that seeds it.

When This Backfires

The decomposition adds cost — at least one extra controlled run per layer you isolate, plus the infrastructure to vary one layer at a time. It does not pay off in several conditions:

  • Hosted-API consumers with no infrastructure control: Anthropic, OpenAI, and Google do not expose batch size, stable seeds, or kernel variants to API clients in 2026. Intrinsic and extrinsic attribution is operationally indistinguishable; the only available mitigation is multi-run characterisation regardless of layer.
  • Single-step or short-horizon agents: when the agent emits one model call with no tool-loop iteration, the orchestration-state layer is empty. Layer attribution adds engineering effort without changing what you can do about the variance.
  • Latency-critical or single-shot tasks: CI gating, code completions, and one-shot migration scripts cannot pay the multi-run cost the framework implies. Tighten guardrails so any single run is acceptable; do not decompose its variance.
  • External non-determinism dominant: when tool outputs are themselves stochastic (web search, third-party APIs, timing-dependent state), the dominant variance is outside all three layers; the framework misattributes external noise to whichever layer happens to be isolated when the external source happens to be quiet.
  • Model-version drift underneath: silent endpoint updates and A/B routing introduce a fourth layer (model identity) the paper does not address. If you suspect drift, compare a frozen baseline against the live endpoint before attributing variance to any of the three named layers.

The dominant misattribution failure mode is treating a batch-invariance drift as a sampling-temperature problem and lowering the temperature until the variance is too small to notice — leaving the bug in place and burning capability headroom. The second is treating an orchestration-state cascade as flaky-test variance and re-running until you get a green build — which never converges because the state surface keeps growing.

Key Takeaways

  • Run-to-run agent variability is not one signal — it is at least three layers (token sampling, infrastructure, orchestration state) that require different mitigations.
  • A single trajectory cannot distinguish the layers; isolate one at a time by holding the others fixed.
  • Temperature 0 does not give you reproducibility on hosted endpoints because the forward pass is not batch-invariant — variability at T=0 is an extrinsic signal, not a sampling one.
  • Decomposition pays off only with a multi-step orchestration loop, enough stack control to act, and mitigation cost justified by the decision. Otherwise, use aggregate metrics and accept the spread.
  • The two costly misattributions are lowering temperature to mask a batch-invariance bug and re-running tests to mask an orchestration-state cascade.
Feedback