Skip to content

Voting / Ensemble Pattern

Run the same task N times in parallel, then aggregate results through voting to trade compute for confidence.

Also known as

Self-Consistency, Majority Voting, Multi-Model Consensus. For the complementary pattern that merges strengths rather than voting, see Fan-Out Synthesis. For specialized multi-lens review, see Committee Review.

Structure

graph TD
    A[Task] --> B[Run 1]
    A --> C[Run 2]
    A --> D[Run N]
    B --> E[Aggregator]
    C --> E
    D --> E
    E -->|Consensus reached| F[Accept]
    E -->|No consensus| G[Escalate / Re-run]

Unlike fan-out synthesis (which assembles the best parts from diverse outputs) or committee review (which applies different lenses), voting runs identical tasks and picks the answer the runs agree on.

Three Fan-Out Tactics

Tactic Setup Diversity source
Self-consistency sampling Same model, same prompt, high temperature Stochastic variation across reasoning paths
Prompt ensembles Same model, varied prompts Different framings surface different reasoning
Multi-model consensus Different models, same prompt Independent training data and failure modes

Multi-model consensus provides the strongest diversity: calling one model N times repeats its mistakes, while different models fail independently.

When Voting Helps

Voting works best on tasks with discrete, verifiable outputs where the correct answer exists but a single run might miss it:

  • Classification — is this input malicious, compliant, or out-of-scope?
  • Security flagging — does this diff introduce a vulnerability? (the adversarial multi-model use case)
  • Content moderation — does this output violate policy?
  • Code correctness checks — does this function handle the edge case?

Voting adds little value for creative synthesis, open-ended generation, or real-time responses where latency matters more than marginal accuracy.

Choosing N

The foundational self-consistency paper (Wang et al. 2023) showed +17.9% accuracy on GSM8K by majority-voting over sampled reasoning paths. But more is not always better.

N Effect
1 Baseline — no voting benefit
3 Strong gains for most classification and verification tasks
5 Marginal improvement over 3; good ceiling for most use cases
7+ Diminishing or inverted returns — more calls can hurt on hard queries

Kore.ai's scaling law research confirms that performance initially increases then decreases with N — more calls help on easy queries but hurt on hard ones. The optimal count is task-dependent; determine it empirically.

Aggregation Strategies

Simple majority voting treats all runs equally but leaves accuracy on the table.

Strategy Mechanism Trade-off
Majority vote Most common answer wins Simple; ignores model quality differences
Weighted vote Runs scored by model capability or historical accuracy Better accuracy; requires calibration data
Confidence-weighted Weight by model's reported confidence score ~46% compute reduction at equivalent accuracy (Taubenfeld et al. 2025)
Unanimous All runs must agree; else escalate High precision, low recall — good for safety-critical
Semantic similarity Cluster answers by meaning, pick densest cluster Handles paraphrased equivalents

Advanced methods like Optimal Weight and Inverse Surprisingly Popular algorithms consistently outperform standard majority voting by accounting for model heterogeneity and answer correlations (Ai et al. 2025).

Cost Trade-Off

N runs costs N× tokens. Confidence-weighted voting cuts this nearly in half by early-stopping when confidence is high — start with N=3 and scale to 5 only if accuracy justifies it; if 3/3 agree confidently, skip the rest.

For routine tasks with strong single-run baselines, voting is wasteful. Reserve it for decisions where a false positive or false negative carries real cost.

Why It Works

LLMs are stochastic: the same prompt samples from a distribution of reasoning paths. Wrong answers scatter — each error follows its own spurious chain of thought — while correct answers cluster, because independent paths converge on the same consistent logic. Majority voting selects the answer most paths agree on, drowning out idiosyncratic errors (Wang et al. 2023).

Multi-model consensus strengthens this further. Different models have independent failure modes rooted in distinct training data and architectures, so an error that is systematic for one model is uncorrelated with errors in another — the correct answer remains the densest cluster even as ensemble size grows.

This entire argument rests on errors being independent, and that assumption is fragile. When the runs share training lineage — same base model, or smaller models distilled from a common teacher — their mistakes correlate, and correlated wrong answers cluster just as tightly as correct ones, so the majority can confidently converge on the same error. Distillation makes nominally "different" models behave alike; tracking pairwise agent-genealogical similarity surfaces when the ensemble's diversity is an illusion and the voting gain has collapsed.

Example

Security review of a pull request using 3-model consensus:

import asyncio, json
from anthropic import Anthropic
from openai import OpenAI

PROMPT = "Review this diff for security vulnerabilities. Return JSON: {\"verdict\": \"SAFE\" | \"UNSAFE\", \"findings\": [...]}\n\n"

async def review_with_model(name, call_fn, diff):
    resp = await call_fn(PROMPT + diff)
    return {"model": name, **json.loads(resp)}

async def vote_on_diff(diff: str):
    results = await asyncio.gather(
        review_with_model("claude", call_claude, diff),
        review_with_model("gpt4", call_gpt4, diff),
        review_with_model("gemini", call_gemini, diff),
    )
    unsafe = sum(1 for r in results if r["verdict"] == "UNSAFE")
    if unsafe >= 2:
        return {"action": "BLOCK", "findings": merge_findings(results)}
    if unsafe == 1:
        return {"action": "MERGE", "dissent": [r for r in results if r["verdict"] == "UNSAFE"]}
    return {"action": "MERGE", "findings": []}

The three models have independent failure modes: a vulnerability one model misses, another is likely to catch.

Key Takeaways

  • Voting trades compute for confidence — same task, multiple runs, aggregated verdict
  • Multi-model diversity beats same-model repetition for genuine independence
  • 3-5 runs covers most use cases; beyond that, returns diminish or invert
  • Confidence-weighted aggregation cuts compute by ~46% vs naive majority voting (Taubenfeld et al. 2025)
  • Reserve voting for discrete, verifiable tasks (classification, security, compliance) — not open-ended generation
  • Distinct from fan-out synthesis (which merges complementary strengths) and committee review (which applies specialized lenses)
Feedback