Incident-to-Eval Synthesis: Production Failures as Evals¶

Every production LLM incident is a candidate regression eval: extract the failure mode, define expected behavior, and add it to a suite that gates deploys.

Also known as

Failure-to-Eval Pipeline, Production Regression Evals. This technique feeds into Eval-Driven Development and complements Golden Query Pairs by providing a systematic source of new eval cases.

Why Incidents Are Your Best Eval Source¶

Manually authored evals reflect what developers think will go wrong. Production incidents reveal what actually goes wrong — real users find edge cases no developer anticipates.

Developers anchor on happy paths and a known failure taxonomy. Production traffic explores the full input distribution — rare phrasing, adversarial queries, and domain combinations no dev imagines. Each incident proves the failure class is real and reproducible, the minimum bar for a useful eval case.

The Pipeline¶

flowchart LR
    A[Production<br>Incident] --> B[Extract<br>Failure Mode]
    B --> C[Define Expected<br>Behavior]
    C --> D[Create Eval<br>Case]
    D --> E[Add to<br>Regression Suite]
    E --> F[Gate CI/CD<br>Deploys]
    F -->|Monitor| A

Each stage produces a specific output:

Stage	Input	Output
Extract failure mode	Incident report, logs, traces	Minimal reproducible input that triggers the failure
Define expected behavior	Domain expert judgment	Concrete expected output or acceptance criteria
Create eval case	Input + expected output	Executable test with a grader (assertion, LLM-as-judge, or both)
Add to suite	Eval case + severity label	Entry in regression dataset with P0/P1/P2 priority
Gate deploys	Suite run results	P0 failures block release; P1/P2 warn

Error Analysis: From Traces to Failure Taxonomy¶

Identifying the failure mode is harder than writing the eval. A structured methodology:

Gather traces -- collect 100+ production traces covering failures and near-misses
Open coding -- experts journal issues without predefined categories, focusing on the first upstream failure in each trace
Axial coding -- group journal entries into a failure taxonomy with frequency counts
Iterate -- repeat until new traces stop producing new categories (theoretical saturation)

The taxonomy reveals which failure modes are most common, severe, and amenable to automated detection.

The axial-coding step can be partly tool-assisted: Braintrust's Topics auto-clusters production traces into failure-mode themes, operationalizing the pattern-discovery step rather than relying solely on manual journaling. [Source: Braintrust -- Automate pattern discovery with Topics]

[Source: Hamel Husain -- Your AI Product Needs Evals, LLM Evals FAQ]

Not Every Incident Becomes an Eval¶

Evals have a maintenance cost. Apply a cost-benefit filter:

Failure Type	Eval Strategy	Rationale
Deterministic format errors (wrong JSON, missing fields)	Assertion / regex check	Cheap to write, cheap to run, catches exact recurrence
Semantic failures (wrong answer, hallucinated facts)	LLM-as-judge eval	More expensive but necessary for subjective correctness
One-off data issues (corrupt input, transient API failure)	Skip -- fix upstream	Eval would test infrastructure, not the LLM feature
Security/safety violations	Mandatory P0 eval	Always worth the cost regardless of frequency

[Source: Hamel Husain -- LLM Evals FAQ]

Tiered Blocking in CI/CD¶

Assign severity when adding the eval case:

P0 -- blocks release. Safety violations, data leaks, complete task failures.
P1 -- warns in CI, requires explicit override. Quality regressions, accuracy drops.
P2 -- logged and tracked. Minor formatting issues, style deviations.

Promptfoo supports configurable pass-rate thresholds across GitHub Actions, GitLab CI, and Jenkins. Use a hard threshold for P0 (100%) and a softer threshold for the full suite (e.g., 95%).

Example¶

A minimal incident-to-eval workflow. An LLM-powered customer service agent hallucinates a refund policy that does not exist.

Step 1: Extract the failure mode

# incident_report.yaml
incident_id: INC-2024-0847
failure_mode: hallucinated_policy
input: "Can I return a laptop after 90 days?"
actual_output: "Yes, our 120-day extended return policy covers laptops."
root_cause: No such 120-day policy exists. Model confabulated.

Step 2: Define expected behavior and create eval case

INCIDENT_EVALS = [
    {
        "id": "INC-2024-0847",
        "input": "Can I return a laptop after 90 days?",
        "expected": "Our return policy is 30 days for electronics. "
                    "A laptop purchased 90 days ago is not eligible for return.",
        "grader": "llm_judge",
        "severity": "P0",
        "tags": ["hallucination", "policy"],
    },
]

Step 3: Run in CI with a grader

import anthropic, json

client = anthropic.Anthropic()

JUDGE_PROMPT = """You are evaluating a customer service agent's response.

Customer question: {input}
Expected answer: {expected}
Agent's actual answer: {actual}

Does the agent's answer contain any fabricated policies, made-up deadlines,
or factual claims not supported by the expected answer?

Reply with JSON: {{"pass": true/false, "explanation": "..."}}"""

def run_incident_evals(agent_fn, evals):
    results = []
    for case in evals:
        actual = agent_fn(case["input"])
        response = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=256,
            messages=[{"role": "user", "content": JUDGE_PROMPT.format(
                input=case["input"], expected=case["expected"], actual=actual
            )}],
        )
        verdict = json.loads(response.content[0].text)
        results.append({
            "id": case["id"], "severity": case["severity"], **verdict
        })

    p0_failures = [r for r in results if not r["pass"] and r["severity"] == "P0"]
    if p0_failures:
        print(f"BLOCKING: {len(p0_failures)} P0 failure(s)")
        for f in p0_failures:
            print(f"  {f['id']}: {f['explanation']}")
        exit(1)

Each incident adds an entry to INCIDENT_EVALS. Cases are never removed, only updated when expected behavior changes.

Growing the Dataset¶

Practitioner-reported maturity tiers:

Minimum viable: 50-100 cases covering critical failure modes
Production-ready: 200-500 cases with broad coverage
Mature: 1000+ cases with tiered severity and CI gating

Every postmortem should ask: "What eval would have caught this?"

[Source: Maxim AI -- Building a Golden Dataset]

When This Backfires¶

Eval drift: Expected behavior in each case is hardcoded at incident time. When the product's intended behavior changes (new policy, updated model, shifting requirements), old eval cases silently become wrong — they now test the previous correct behavior. Without a review cadence, the suite drifts and passing CI stops being meaningful.
Grader decay for LLM-as-judge: LLM judges require periodic calibration against human ratings. If the judge model is updated or the prompt drifts, scoring shifts without any test case changing — a passing suite may no longer reflect actual quality.
Volume without triage: High-traffic systems generate hundreds of incidents with overlapping failure modes. Without deduplication and priority labeling, the suite balloons with redundant cases that slow CI without improving coverage.

Key Takeaways¶

Production incidents are the highest-signal eval source
Use error analysis (open and axial coding) to extract failure modes from traces
Cheap assertions for deterministic failures; LLM-as-judge for semantic ones
P0 failures block deploys; P1/P2 warn
Every closed incident should produce a new eval case