Skip to content

The Synthetic Ground Truth Fallacy

AI-generated artifacts reflect the model's statistical priors, not ground truth. Treating them as equivalent to human-verified artifacts introduces feedback loops that compound errors.

The Fallacy

Teams use AI to generate tests, evals, documentation, or training examples and treat the outputs as interchangeable with human-verified artifacts. The reasoning: AI is fast and the outputs look correct, so accepting them as ground truth is a productivity gain.

But the outputs look correct because the model generates plausible outputs, not verified ones. They measure what the model finds likely, not what is true.

Why It Fails

Feedback Loops Degrade Over Generations

When AI-generated artifacts feed back into training, evaluation, or quality gates, distributional shift compounds across each cycle. "The Curse of Recursion" documented this in OPT-125m: by generation 9, outputs degraded from coherent English to fragmented, repetitive text — the model's distribution had drifted far from the original. [Source: The Curse of Recursion: Training on Generated Data Makes Models Forget]

Smaller-scale versions of this loop appear in daily coding agent workflows:

  • AI-generated test suites pass when the implementation matches the model's expectations, not when it is correct
  • AI-generated evals score plausibility, not actual quality
  • Fine-tuning datasets from AI completions amplify existing biases without introducing corrective signal

Eval Scores Are Not Self-Validating

Anthropic's multi-agent research documentation notes that "people testing agents find edge cases that evals miss" and recommends you "[c]alibrate against humans: Frequently compare LLM judge outputs against expert human judgment." [Source: Multi-Agent Research System]

Broken graders compound the problem. In the CORE-Bench case, rigid string-match grading penalized correct answers — "96.12" failed against the expected "96.124991" — and scores jumped from 42% to 95% after the graders were fixed. A pass rate that reflects a broken grader, not agent capability, is what treating unvalidated evaluation infrastructure as ground truth produces. [Source: Demystifying Evals for AI Agents]

Environmental Feedback Is the Correct Ground Truth

Anthropic's agent design guidance recommends that agents gain "ground truth from the environment at each step (such as tool call results or code execution)" — not from AI-generated assessments. [Source: Building Effective Agents]

The agentic handbook anchors reflection loops to "objective signals: tests, lints, schema validation, compilation, eval rubric" and notes explicitly that "self-critique without objective checks is also brittle — models can rationalize." [Source: The Agentic AI Handbook]

The Scope of the Problem

Artifact Synthetic Ground Truth Risk
Test suite Tests pass when code matches model priors, not when code is correct
Eval rubric Scores plausibility; misses edge cases humans catch
Documentation Reflects what the model predicts the docs say, not what the system does
Training data Amplifies distributional biases; degrades model over generations
PR description Summarizes what the model finds salient, not the actual review criteria

Example

A team building a coding agent uses Claude to generate an eval suite covering 50 representative tasks. The evals look thorough. The team ships based on a 90% pass rate.

Six months later, users report failures on tasks that the evals don't cover. Post-mortem reveals the 50 eval tasks reflected the model's idea of "representative coding problems" — skewed toward patterns common in its training data, under-representing the team's actual workload.

The evals measured what the model found plausible. They never measured what the users needed.

The fix: seed evals from real production failures and user-reported bugs, then calibrate LLM judge scores against human expert review before using them as a quality gate.

When This Backfires

The fallacy is over-applying the rule. Synthetic and AI-generated artifacts are legitimate inputs when used as starting points feeding incremental verification, not ground truth.

  • Bootstrapping test coverage: AI-generated test stubs seeded from real code paths are a win — the risk is trusting pass/fail rates before humans verify the stubs reflect correct behavior
  • Data augmentation: Synthetic examples improve coverage of rare cases when added to a dataset with real-world grounding — the fallacy fires only when synthetic data replaces real data
  • Eval templating: LLM-generated rubrics reduce scaffolding work; calibrating them against human judgment converts them from synthetic ground truth into validated artifacts
  • Short feedback loops: An agent checking its output against a compiler or test runner uses environmental ground truth — the fallacy is AI-on-AI assessment, not AI-plus-deterministic-signal loops

The pattern to avoid is circular: AI generates artifact → AI judges artifact → scores accepted without external grounding. Any external validation signal — human review, test execution, real user behavior — breaks the circularity.

Key Takeaways

  • AI-generated artifacts measure model priors, not correctness — using them as ground truth introduces feedback loops
  • LLM-as-judge eval scores require regular calibration against human judgment to remain valid
  • Broken graders produce misleading scores that look authoritative — validate evaluation infrastructure before trusting its output
  • Ground truth comes from the environment: test execution, user behavior, deterministic validators — not model self-assessment
  • The fallacy generates multiple downstream anti-patterns: reward hacking, documentation drift, training data contamination
Feedback