Incident-to-Eval Synthesis: Production Failures as Evals¶
Every production LLM incident is a candidate regression eval: extract the failure mode, define expected behavior, and add it to a suite that gates deploys.
Also known as
Failure-to-Eval Pipeline, Production Regression Evals. This technique feeds into Eval-Driven Development and complements Golden Query Pairs by providing a systematic source of new eval cases.
Why Incidents Are Your Best Eval Source¶
Manually authored evals reflect what developers think will go wrong. Production incidents reveal what actually goes wrong — real users find edge cases no developer anticipates.
Developers anchor on happy paths and a known failure taxonomy. Production traffic explores the full input distribution — rare phrasing, adversarial queries, and domain combinations no dev imagines. Each incident proves the failure class is real and reproducible, the minimum bar for a useful eval case.
The Pipeline¶
flowchart LR
A[Production<br>Incident] --> B[Extract<br>Failure Mode]
B --> C[Define Expected<br>Behavior]
C --> D[Create Eval<br>Case]
D --> E[Add to<br>Regression Suite]
E --> F[Gate CI/CD<br>Deploys]
F -->|Monitor| A
Each stage produces a specific output:
| Stage | Input | Output |
|---|---|---|
| Extract failure mode | Incident report, logs, traces | Minimal reproducible input that triggers the failure |
| Define expected behavior | Domain expert judgment | Concrete expected output or acceptance criteria |
| Create eval case | Input + expected output | Executable test with a grader (assertion, LLM-as-judge, or both) |
| Add to suite | Eval case + severity label | Entry in regression dataset with P0/P1/P2 priority |
| Gate deploys | Suite run results | P0 failures block release; P1/P2 warn |
Error Analysis: From Traces to Failure Taxonomy¶
Identifying the failure mode is harder than writing the eval. A structured methodology:
- Gather traces -- collect 100+ production traces covering failures and near-misses
- Open coding -- experts journal issues without predefined categories, focusing on the first upstream failure in each trace
- Axial coding -- group journal entries into a failure taxonomy with frequency counts
- Iterate -- repeat until new traces stop producing new categories (theoretical saturation)
The taxonomy reveals which failure modes are most common, severe, and amenable to automated detection.
The axial-coding step can be partly tool-assisted: Braintrust's Topics auto-clusters production traces into failure-mode themes, operationalizing the pattern-discovery step rather than relying solely on manual journaling. [Source: Braintrust -- Automate pattern discovery with Topics]
[Source: Hamel Husain -- Your AI Product Needs Evals, LLM Evals FAQ]
Not Every Incident Becomes an Eval¶
Evals have a maintenance cost. Apply a cost-benefit filter:
| Failure Type | Eval Strategy | Rationale |
|---|---|---|
| Deterministic format errors (wrong JSON, missing fields) | Assertion / regex check | Cheap to write, cheap to run, catches exact recurrence |
| Semantic failures (wrong answer, hallucinated facts) | LLM-as-judge eval | More expensive but necessary for subjective correctness |
| One-off data issues (corrupt input, transient API failure) | Skip -- fix upstream | Eval would test infrastructure, not the LLM feature |
| Security/safety violations | Mandatory P0 eval | Always worth the cost regardless of frequency |
[Source: Hamel Husain -- LLM Evals FAQ]
Tiered Blocking in CI/CD¶
Assign severity when adding the eval case:
- P0 -- blocks release. Safety violations, data leaks, complete task failures.
- P1 -- warns in CI, requires explicit override. Quality regressions, accuracy drops.
- P2 -- logged and tracked. Minor formatting issues, style deviations.
Promptfoo supports configurable pass-rate thresholds across GitHub Actions, GitLab CI, and Jenkins. Use a hard threshold for P0 (100%) and a softer threshold for the full suite (e.g., 95%).
Example¶
A minimal incident-to-eval workflow. An LLM-powered customer service agent hallucinates a refund policy that does not exist.
Step 1: Extract the failure mode
# incident_report.yaml
incident_id: INC-2024-0847
failure_mode: hallucinated_policy
input: "Can I return a laptop after 90 days?"
actual_output: "Yes, our 120-day extended return policy covers laptops."
root_cause: No such 120-day policy exists. Model confabulated.
Step 2: Define expected behavior and create eval case
INCIDENT_EVALS = [
{
"id": "INC-2024-0847",
"input": "Can I return a laptop after 90 days?",
"expected": "Our return policy is 30 days for electronics. "
"A laptop purchased 90 days ago is not eligible for return.",
"grader": "llm_judge",
"severity": "P0",
"tags": ["hallucination", "policy"],
},
]
Step 3: Run in CI with a grader
import anthropic, json
client = anthropic.Anthropic()
JUDGE_PROMPT = """You are evaluating a customer service agent's response.
Customer question: {input}
Expected answer: {expected}
Agent's actual answer: {actual}
Does the agent's answer contain any fabricated policies, made-up deadlines,
or factual claims not supported by the expected answer?
Reply with JSON: {{"pass": true/false, "explanation": "..."}}"""
def run_incident_evals(agent_fn, evals):
results = []
for case in evals:
actual = agent_fn(case["input"])
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
messages=[{"role": "user", "content": JUDGE_PROMPT.format(
input=case["input"], expected=case["expected"], actual=actual
)}],
)
verdict = json.loads(response.content[0].text)
results.append({
"id": case["id"], "severity": case["severity"], **verdict
})
p0_failures = [r for r in results if not r["pass"] and r["severity"] == "P0"]
if p0_failures:
print(f"BLOCKING: {len(p0_failures)} P0 failure(s)")
for f in p0_failures:
print(f" {f['id']}: {f['explanation']}")
exit(1)
Each incident adds an entry to INCIDENT_EVALS. Cases are never removed, only updated when expected behavior changes.
Growing the Dataset¶
Practitioner-reported maturity tiers:
- Minimum viable: 50-100 cases covering critical failure modes
- Production-ready: 200-500 cases with broad coverage
- Mature: 1000+ cases with tiered severity and CI gating
Every postmortem should ask: "What eval would have caught this?"
[Source: Maxim AI -- Building a Golden Dataset]
When This Backfires¶
- Eval drift: Expected behavior in each case is hardcoded at incident time. When the product's intended behavior changes (new policy, updated model, shifting requirements), old eval cases silently become wrong — they now test the previous correct behavior. Without a review cadence, the suite drifts and passing CI stops being meaningful.
- Grader decay for LLM-as-judge: LLM judges require periodic calibration against human ratings. If the judge model is updated or the prompt drifts, scoring shifts without any test case changing — a passing suite may no longer reflect actual quality.
- Volume without triage: High-traffic systems generate hundreds of incidents with overlapping failure modes. Without deduplication and priority labeling, the suite balloons with redundant cases that slow CI without improving coverage.
Key Takeaways¶
- Production incidents are the highest-signal eval source
- Use error analysis (open and axial coding) to extract failure modes from traces
- Cheap assertions for deterministic failures; LLM-as-judge for semantic ones
- P0 failures block deploys; P1/P2 warn
- Every closed incident should produce a new eval case