Skip to content

Assertion-Free Test Theater in Agent-Authored Patches

Agent-authored tests often execute code without asserting on its output, so quality gates anchored on "tests are present" overstate verification strength.

The Pattern

An assertion-free test file calls the changed function, sometimes wraps the call in a try/except, and never checks the result. The patch ships a test_* file beside the production change; CI counts the file; the merge gate is satisfied. No behavioural verification happened. In the largest empirical study to date, 80.2% of agent-authored test patches contained weak or no explicit oracle signals — across 86,156 test-file patches from 33,596 pull requests in 2,807 GitHub repositories, covering OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code.

The defect is not the test file. The defect is the merge gate that reads "test file present" as "behaviour verified." Once that gate exists, an agent optimising against it correctly emits the cheapest output that passes — a smoke test, a call-only stub, or a try/except that swallows everything (Banik et al. 2026).

Three Tiers of Oracle Signal

Banik et al. classify test patches against an eight-category taxonomy that collapses to three tiers (arxiv:2606.18168):

Tier What the test does Behavioural signal
No oracle Smoke / call-only patterns, exception-swallowing try/except, assertion-free flows None beyond "does not crash"
Weak oracle Generic assertTrue(x), assertNotNull(x), truthiness checks Crash plus broad truthiness — boundary and value checks absent
Strong oracle Specific value, boundary, or state checks tied to expected behaviour Direct behavioural verification

After controlling for agent type, PR size, repo popularity, task type, and language, strong-oracle patches merge at higher rates (odds ratio 1.28, p < 0.001) (Banik et al. 2026). Raw pre-adjustment numbers can run the other way — which is exactly why presence-based dashboards make the gap invisible. The fix is oracle-aware grading, not file counting.

Why It Happens

A test without an oracle exercises a code path; the only outcomes it can observe are crashes. Behavioural correctness — does the function return the right value, with the right type, under the right preconditions — is unobservable to such a file. When the merge gate's predicate is "tests touch the changed code? ✓", a 30-line test_* that calls the new function and catches every exception passes the gate without producing any verification signal.

This is the oracle problem reappearing in the agent-PR pipeline: oracle quality is load-bearing, presence is not. Adjacent SWE-agent research confirms it from the other direction — when oracle information signals are extracted explicitly and "perfectly obtained," agent task-resolution rates rise sharply. The agent is optimising correctly against the gate it was given; the gate has misidentified what it measures. The test-homogenization trap lives one tier higher (assertions present but biased); this anti-pattern lives below it (assertions absent entirely).

When This Backfires

Oracle-aware gates should not be applied blindly. Four conditions make presence-based or smoke-only verification defensible:

  • Pure import / integration ping tests. "Does the package load? Does the worker reach the queue?" is the test by design — call-only is correct.
  • Property-based and fuzz tests. The assertion lives in the property runner, not as an assert line in the test body — AST-level oracle classification misses this; exempt the file type first.
  • Early-stage or spike codebases. Under fast API churn, specific assertions rot within days. A smoke test that survives churn carries more long-run signal than a brittle value check.
  • Teams whose merge gate is deep human review. Reviewers catch assertion-free tests when they read the diff; the gate is not the CI predicate.

The trap is sharpest when (a) the merge gate is automated, (b) the file's claimed type is behavioural / unit, and (c) the agent wrote both the production change and the test.

Mitigations

  • Grade oracles, not files. Count assertions per touched function or apply the eight-category oracle taxonomy at AST level. Block patches whose oracle distribution skews to "none" for behavioural-type files.
  • Treat assertion count as a partial signal. Counts catch the floor but miss biased assertions — pair with the test-homogenization-trap mitigations (mutation-guided generation, human-authored edge cases, mixed methodologies).
  • Apply differential testing where a reference exists. When a known-good implementation is available, differential testing — comparing outputs of the change against the reference for the same inputs — substitutes for explicit assertions, an oracle category cataloged in the canonical oracle-problem survey.
  • Close the loop at review. Surface oracle-strength in the PR check UI so reviewers see "4 new test files; 0 contain value assertions" before merging — failure-driven iteration applied to the gate itself.

Key Takeaways

  • File-presence quality gates overstate verification by roughly five-fold when 80.2% of agent-authored test patches carry weak or no oracle signal (Banik et al. 2026).
  • The strong-oracle merge advantage (OR 1.28) only surfaces after controlling for confounders — presence-based dashboards systematically hide the gap.
  • The defect is the gate's predicate, not the agent's output; an agent optimising against "tests exist" correctly emits the cheapest passing output.
  • Smoke-only and import-ping tests are legitimate; the trap is grading them as behavioural verification, not writing them.
  • Replace presence checks with oracle-aware grading — AST-level assertion counts, taxonomy classification, or differential testing where a reference is available.
  • The Test Homogenization Trap — same-model code+test shares blind spots; this anti-pattern is the no-assertion failure mode one tier below.
  • Failure-Driven Iteration — the broader warning about agent-authored tests encoding the agent's misunderstanding.
  • Premature Completion — declaring done on a weak signal; assertion-free tests are exactly that signal.
  • Trust Without Verify — polished output ≠ verified output; the same logic applied to test files instead of prose.
  • Happy Path Bias — the upstream tendency: skipping error paths in code; this page is the downstream form in test code.
Feedback