Multi-Turn Conversation Evaluation¶
Multi-turn conversation evaluation pairs per-turn scoring with a trace-level resolution check to catch failures that compound across turns.
Two Layers, Different Failures¶
A conversational agent can score 100% on per-turn brand alignment while resolving 40% of customer issues — every reply is polite and policy-compliant, but the conversation never reaches a fix. Per-turn scoring sees only local quality; it cannot tell whether the agent dropped a fact from turn 2 or kept the customer in a polite loop without progress. (Braintrust, 2026)
Trace-level scoring evaluates the conversation as one unit against a goal-completion criterion (typically binary: "was the issue resolved?"). It catches the across-turn failures per-turn scoring is blind to but produces a coarse verdict that hides which turn caused the drift. Both scores are needed, and the most diagnostic case is when they disagree.
| Per-turn | Trace-level | What it means |
|---|---|---|
| High | High | Both layers happy — good baseline |
| Low | High | Rough responses but issue resolved — fix tone, not flow |
| High | Low | "Sounds good, fixes nothing" — system prompt lacks resolution steps |
| Low | Low | Bot is broken — debug from the trace |
The high-per-turn + low-trace cell is the cheapest diagnostic in the table — per-turn metrics alone would call this conversation a success.
Three Failure Classes Single-Turn Scoring Misses¶
Conversations carry state across turns. Three failure classes only emerge in that state-carrying setting:
- Context loss — the agent forgets a fact the user provided earlier (order number, account email). The per-turn response is locally coherent against its immediate window; the next turn re-asks for the dropped fact.
- Intent drift — the conversation gradually leaves the original goal. Stateful drift detection (RNN over per-turn embeddings, DeepContext) reaches F1 0.84 on multi-turn adversarial intent drift versus 0.67 for stateless filters like Llama-Prompt-Guard-2 and Granite-Guardian — the gap is exactly the multi-turn signal stateless eval cannot see. (DeepContext, 2026)
- Circular exchange — every turn is locally fine but the conversation makes no progress. The trace-level resolution check is the first metric that fails on a polite loop.
Drift-Bench formalises a related taxonomy for input faults that only surface across clarification turns: implicit intent, missing parameters, false presuppositions, and ambiguous expressions. Agents that handle each turn well in isolation show substantial performance drops when these faults force iterative disambiguation. (Drift-Bench, 2026)
What to Wire¶
Three pieces have to be in place before trace-level scoring works:
- Span grouping — every turn in the same conversation logs as a child span under a shared trace ID. Without grouping, each turn is an unrelated root span; trace-level scoring has nothing to score because the framework cannot tell which turns belong together. (Braintrust, 2026)
- A per-turn scorer — graded against criteria that apply turn-by-turn (helpfulness, tone, policy compliance, format). LLM-as-a-judge against three or four binary sub-criteria is the standard pattern.
- A trace-level scorer — graded against goal completion for the whole conversation. Resolution is typically binary: did the agent satisfy the user's stated goal by the end of the trace.
Online scoring rules then run both scorers asynchronously against new traces in production. Per-turn rules are span-scoped; trace-level rules are trace-scoped. Sampling rate is the cost lever — 100% at low volume, lower as traffic scales.
Why It Works¶
Failures emerging from accumulated state must be scored at the level where that state is observable. Per-turn scoring projects the conversation into independent slices and evaluates each against its local context — exactly the projection in which context loss, intent drift, and circular exchange are invisible. Trace-level scoring evaluates the conversation as a single unit against a goal-completion criterion, which is the smallest unit that exposes those three failure classes. The state-locality argument generalises beyond chatbots: the 2025 multi-turn agent survey identifies five evaluation dimensions — task completion, response quality, user experience, memory and context retention, and planning and tool integration — and only the first two are cleanly observable from a per-turn projection. (Multi-Turn Agent Survey, 2025)
When This Backfires¶
Wiring trace-level scoring is not free. Narrow scope when:
- The agent is one-shot, not conversational. A PR-fix bot or CI agent that takes a task and returns a patch has no across-turn failure surface. pass@k metrics and trajectory decomposition already cover the relevant axes; adding trace-level conversation scoring measures a dimension that does not exist in the workload.
- The judge is below the precision floor. AgentRewardBench evaluated 12 LLM judges across 1,302 web-agent trajectories — no judge cleared 70% precision (GPT-4o 69.8%, Claude 3.7 Sonnet 68.8%, against 89.3% human inter-annotator agreement). Below that ceiling, roughly one trace-level verdict in three is wrong, and judge errors cluster around grounding mismatch, misleading reasoning, missed details, and misunderstood actions. Pin to deterministic resolution checks (was a refund issued, was a ticket closed) before defaulting to an LLM judge. (AgentRewardBench, 2025)
- Logs are not grouped. If production logs every LLM call as its own root span, retrofitting trace-level scoring requires either re-instrumentation or reconstruction from timestamps. The failure mode is silent — scoring runs against fragments and produces numbers that look real.
- Volume × per-trace cost exceeds the violation cost. Every trace-level LLM-judge call is its own inference; at tens of thousands of conversations per day, the bill can exceed the value of the violations caught. Sampling reduces the cost but moves the metric from continuous to spot-check.
The related but distinct Trajectory-Opaque Evaluation Gap covers the conjugate case for safety: outcome-only grading misses safety violations even when the conversation is single-turn. Pair the two — per-turn + trace-level for quality and resolution, outcome + trajectory auditing for safety and robustness.
Example¶
A customer-support chatbot resolves a return-and-refund request over four turns. Two scoring layers are wired against the trace:
Per-turn scorer (brand alignment, LLM-judge against helpfulness + tone + policy):
Turn 1: 50% (acknowledged frustration, no next steps)
Turn 2: 50% (asked for order number, did not state return policy)
Turn 3: 50% (offered exchange, did not address the discount-code request)
Turn 4: 100% (clear conditions for qualifying)
Average: 62.5%
Trace-level scorer (conversation quality, LLM-judge against binary resolution):
Conversation quality: 100%
Reason: agent obtained order number, acknowledged the discount-code request,
confirmed return status, and stated qualifying conditions. Issue
resolved.
The divergence is diagnostic. Per-turn at 62.5% would normally trigger a tone-and-policy review; trace at 100% says the issue was fixed despite rough edges. The action is different in each direction: the per-turn scorer points engineering at the response-template prompt, the trace-level scorer at goal-completion logic. Acting on either number alone produces the wrong fix. (Braintrust, 2026)
Key Takeaways¶
- Per-turn scoring evaluates each response in its local context; trace-level scoring evaluates the conversation against a goal-completion criterion. Both are required because they catch orthogonal failures.
- The high-per-turn + low-trace cell ("sounds good, fixes nothing") is the cheapest diagnostic — it points at the resolution gap that per-turn metrics alone would call a success.
- Three failure classes only emerge across turns: context loss, intent drift, and circular exchange. Single-turn scoring is structurally blind to all three.
- Span grouping is the prerequisite. Without a shared trace ID across turns, the framework cannot tell which turns belong to one conversation.
- Trace-level LLM judges have a measured precision ceiling near 70%; deterministic resolution checks (was a refund issued) are more reliable when they are expressible.