Skip to content

Independent Test Generation in Multi-Agent Code Systems

Separate code generation and test generation into independent agents so the test writer never sees the generated code. When a single agent writes both, test accuracy drops from 87.8% to 61.0% — the test writer inherits the code writer's blind spots.

Also known as

Blind Test Generation, Code-Test Separation Pattern. For the general evaluator-generator loop, see Evaluator-Optimizer Pattern. For human-written tests as agent spec, see TDD Agent Development. For role specialization in parallel agents, see Specialized Agent Roles.

The problem: shared-context bias

When a single agent generates code and then writes tests for it, the tests confirm the code's logic rather than challenge it — following the same reasoning path and missing the same edge cases.

AgentCoder (Huang et al., 2023) quantified this: separating test generation into an independent agent raised test accuracy from 61.0% to 87.8% on HumanEval benchmarks.

Three-agent architecture

The pattern uses three agents with no shared context between code and test paths:

graph TD
    R[Requirements] --> P[Programmer Agent]
    R --> T[Test Designer Agent]
    P --> E[Test Executor Agent]
    T --> E
    E -->|PASS| O[Accept Code]
    E -->|FAIL + errors| P
Agent Input Output Key constraint
Programmer Requirements + error feedback Code implementation Chain-of-thought: clarify → algorithm → pseudocode → implement
Test Designer Requirements only Test cases (basic + edge + stress) Never sees the generated code
Test Executor Code + tests Pass/fail + error messages Deterministic execution, routes failures back to Programmer

The test designer works from the specification, not the implementation. This stops the test writer from accommodating implementation quirks.

Fewer specialized agents beat more generalist agents

Framework Agents HumanEval pass@1 (GPT-4) Token overhead
AgentCoder 3 96.3% 56.9K
MetaGPT 5+ 85.9% 138.2K
ChatDev 4+ 84.1% 183.7K
AgentVerse 4+ 89.0% 149.2K

Three tightly-scoped agents with clear contracts outperform larger teams with diffuse responsibilities at 59% lower token cost.

Ablation: each agent pulls its weight

Removing any component degrades the system (GPT-3.5 on HumanEval):

Configuration pass@1 Delta
Programmer only 61.0%
+ Test Designer 64.0% +3.0
+ Test Executor 64.6% +3.6
Full system (all three) 79.9% +18.9

The non-linear jump when all three work together shows what produces the gains: closing the loop with execution and error routing, not role separation alone.

Iteration budget

AgentCoder evaluated up to five refinement rounds on HumanEval and MBPP; accuracy rises fastest in the first two iterations and flattens afterward. A 3–5 round cap is a reasonable starting point — beyond that, continued failures indicate a spec or approach problem rather than something iteration will fix. See also agent self-review loops.

When this backfires

  • Test designer inherits spec errors: both agents receive the same requirements document, so ambiguities, underspecifications, or outright errors reach both. The pattern removes code-context bias but cannot make up for a flawed or incomplete specification.
  • Generated tests can be wrong themselves: independent generation does not guarantee test correctness. BACE (arXiv:2603.28653) documents how "incorrect code frequently passes faulty or trivial tests, while valid solutions are often degraded to satisfy incorrect assertions". So treating agent-generated tests as ground truth is fragile. Use public reference tests or a human-reviewed test suite as an anchor when correctness guarantees matter.
  • Benchmark gap: AgentCoder measured its results on single-function HumanEval tasks. Multi-file codebases add cross-module dependencies and integration constraints that a spec-only test designer cannot fully anticipate, so test accuracy improvements may be smaller in practice.
  • Token overhead is real: running three agents uses about 57K tokens per task versus a single-agent approach, roughly doubling cost. For high-volume, low-complexity tasks such as boilerplate generation, the accuracy gain may not justify the overhead.

Applying the pattern

  • Multi-agent frameworks: assign distinct system prompts. The test designer's prompt excludes code context, applying specialized agent roles. The programmer receives only execution errors, not test source.
  • CI/CD pipelines: run code and test generation as separate agent invocations with isolated contexts. Route failures back with error context only.
  • Single-agent tools: approximate the pattern by running test generation in a separate session with fresh context, using only requirements as input.

Example

A team building a Python utility library applies the three-agent pattern to generate and validate a merge_sorted_lists function.

Programmer agent system prompt:

You are a Python programmer. Given a function specification,
produce a correct implementation. If you receive test failure
output, fix the code based on the error messages only.
Do not request or reference any test code.

Test designer agent system prompt:

You are a test engineer. Given a function specification,
produce pytest test cases covering: basic behavior, edge cases
(empty lists, duplicates, single-element), and stress cases
(10k elements). You will never see the implementation.
Write tests based solely on the specification.

Specification (shared input):

merge_sorted_lists(a: list[int], b: list[int]) -> list[int]
Merge two sorted integer lists into a single sorted list.
Time complexity: O(n + m).

The test designer generates tests from the spec alone — including edge cases like merge_sorted_lists([], []) and merge_sorted_lists([1,1,1], [1,1]) that a programmer-coupled test writer typically omits. The test executor runs both artifacts, routes any FAILED output back to the programmer with error messages only, and the loop repeats until all tests pass or the iteration cap is reached.

Sources

Feedback