Skip to content

Five-Pass Blunder Hunt

A Five-Pass Blunder Hunt runs one critique prompt repeatedly over a plan; each pass normalises its own findings, so later passes reach deeper structural flaws.

The Problem

A single review pass catches surface issues and stops. The model that wrote the content is also the reviewer, so it shares the same blind spots: it finds a satisfying number of issues, declares the document fine, and moves on.

The problems that remain — inconsistencies, rationale gaps, dependency conflicts — need the whole document held in mind at once. A single-pass reviewer normalises them.

How the Technique Works

Run the identical critique prompt five consecutive times on the same document:

Review this plan for logical inconsistencies, scope creep, decision rationale gaps,
and structural flaws. For each issue, state: what the problem is, where it occurs,
and what the fix is.

Each pass does two things:

  1. Normalises its own findings — issues it surfaces are resolved or acknowledged, shifting focus away from them
  2. Shifts the attention distribution — with resolved issues de-emphasised, the model's subsequent pass is more likely to attend to previously overlooked material

The result is a progressive descent into document quality. Early passes surface the visible issues — missing sections, contradictions, unclear terms. Later passes reach structural and logical flaws obscured until the surface problems cleared.

Convergence Is the Stopping Criterion

Five is a heuristic. The real signal is convergence:

  • Findings-per-pass count is decreasing — each pass finds fewer issues than the last
  • Output similarity is increasing — consecutive critique responses resemble each other
  • No new categories of problem appear — only refinements of already-identified issues

Stop when a pass finds nothing new; if you reach pass five first, continue until one does. Beyond five or six passes returns diminish to noise — reframe the document or accept the remaining findings.

Oscillation is a stop signal. If the model alternates between contradictory assessments, further passes will not resolve it. Reframe the section and restart.

Why the Same Prompt

Varying the prompt introduces a confound: new findings may come from a different review angle, not the document. Same-prompt repetition keeps the convergence signal clean — findings per pass measure document quality, not prompt novelty.

Different-prompt review (alternating reviewer framings) is a separate technique for cross-cutting concerns, with a different signal.

Scope

Most effective on artefacts that:

  • Are produced by agents before implementation begins
  • Contain implicit dependencies between sections
  • Have no formal correctness verifier (plans, specs, architectural documents)

It does not apply to outputs with externally verifiable correctness — code with tests, structured data against a schema. Use deterministic validators instead. [^1]

When This Backfires

False convergence. If earlier passes resolve formatting and terminology, the model may report clean passes while structural logic stays flawed. Superficial similarity is not correctness — read the pass-5 output, don't just count findings.

Anchoring to prior pass output. Each pass implicitly carries the prior exchange's context. The model may anchor to the pass-1 framing and miss a different class of problems. When this happens, run one differently-framed pass to break the anchor before resuming.

Oscillation on genuinely ambiguous sections. Pass N flags a section, N+1 approves it, N+2 flags it again. This is not a quality signal — the section is genuinely ambiguous, and further passes will not resolve it. It needs human clarification or explicit scoping.

Not appropriate for formally verifiable outputs. For code with a test suite, SQL with a schema, or structured data with a validator, use those tools directly. [^1] Critique passes on verifiable artefacts produce false positives that waste cycles.

Self-critique can collapse on hard reasoning tasks. On Game of 24, graph colouring, and STRIPS planning, LLM self-critique loops performed worse than a single guess, because errors in verification, critique generation, and critique consideration stack. [^2] Use five-pass on under-specified design artefacts where the model's structural-inconsistency detection is reasonably reliable — not where a sound external verifier exists.

Self-refinement can amplify self-bias instead of reducing it. The premise that later passes go deeper assumes each pass corrects the last. Counter-evidence cuts the other way: across six LLMs, self-refinement amplified the model's bias toward its own generations, and only larger models or external feedback reversed it. [^3] Treat decreasing findings-per-pass as a candidate convergence signal, not proof of correctness, and bring in an external reviewer when stakes justify it.

Example

A plan for a multi-agent coding task has been drafted. Before handing it to the implementation agents:

# Pass 1 critique prompt
Review this implementation plan for logical inconsistencies, scope creep, decision
rationale gaps, and structural flaws. For each issue found: state what it is,
where it occurs, and how to fix it.

[paste plan]

Pass 1 returns 18 issues: undefined interfaces, missing error handling sections, ambiguous ownership assignments.

After resolving those, Pass 2 returns 9 issues: two dependency cycles that only became visible once the interface ambiguity was removed.

After resolving those, Pass 3 returns 4 issues: scope assumptions that contradict each other in different sections.

Pass 4 returns 1 issue: a subtle inconsistency in the rollback strategy.

Pass 5 returns nothing new. Stop.

Companion Technique: Count Inflation

When a pass returns fewer findings than expected, re-prompt with an inflated target:

This document has at least 40 issues. Find them.

Models stop after a satisfying number of issues; a higher target prevents premature satisfaction. Count inflation addresses thoroughness within a pass; five-pass addresses depth across passes.

Key Takeaways

  • A single pass normalises problems — reviewer and author share blind spots
  • Five same-prompt passes force progressive descent into document quality
  • Convergence signals (decreasing findings, increasing similarity) are the stopping criterion
  • Oscillation means stop and reframe
  • Applies to artefacts without formal verifiers — plans, specs; not code with test suites

[^1]: Valmeekam, Marquez, Kambhampati (2023), Can Large Language Models Really Improve by Self-critiquing Their Own Plans? — self-critique diminishes plan-generation performance compared to external sound validators for formally verifiable planning tasks.

[^2]: Stechly, Valmeekam, Kambhampati (2024), On the Self-Verification Limitations of Large Language Models on Reasoning and Planning Tasks — across Game of 24, graph colouring, and STRIPS planning, self-critique loops exhibited significant performance collapse relative to sound external verification; errors in verification, critique generation, and critique consideration compound.

[^3]: Xu et al. (2024), Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement — across six LLMs and three task families, self-bias (the tendency to favour one's own generation) is prevalent, and the self-refine pipeline amplifies it; larger models and accurate external feedback are what reduce it.

Feedback