Skip to content

Traces Need Feedback to Power Learning

A trace shows what an agent did; feedback shows whether it was right. Couple them and the trace store becomes a learning corpus.

The Gap a Trace Alone Cannot Close

Tracing-as-debugging works for one bug at a time. It does not scale into a learning loop, because the trace alone does not say whether the trajectory was good. As Harrison Chase puts it: "Traces alone do not create that loop. You also need feedback: signals that tell you whether the agent's behavior was useful, accepted, rejected, inefficient, risky, or wrong" (LangChain, May 5 2026).

The same trace can describe a 40-step success or a 40-step failure. Without a verdict you cannot filter failures worth turning into evals, compare good and bad trajectories on one task, drive incident-to-eval synthesis from production volume, or detect drift across the three improvement layers — model weights, harness scaffolding, retrieved context.

The fix is structural: every trace gets a verdict attached to the run, not stored in a parallel analytics system whose join keys never line up with the trace ID.

The Four Sources of Feedback

The article names four feedback sources. Each has a different cost, latency, and noise profile, and a production system usually wires several together (LangChain).

Source Example Strength Failure mode
Direct user Thumbs up/down, star rating, written correction Cleanest verdict Sparse — most users never rate
Indirect user Lines of code accepted, diffs reverted, ticket reopened, answer copied, same question re-asked High volume Misattribution — a reverted diff might not be the agent's fault
LLM-as-judge Online evaluator scoring helpfulness or policy compliance Runs at scale Bias and ungrounding — judges drift from human verdicts when never recalibrated
Deterministic rule Regex, schema check, citation validator Cheap, exact, no model call Only catches what you knew to look for

The article's deterministic example: Claude Code's leaked userPromptKeywords.ts regex scans prompts for frustration words like "wtf", "horrible", "awful" and emits the hit as a feedback signal (PCWorld, Blake Crosley analysis). When a cheap rule captures the signal, no model call is needed to label the trace.

What the Platform Has to Do

Chase reduces the platform contract to three behaviours: store traces (trajectory, tool calls, metadata, timing, errors), store feedback attached to the run/trace/thread, and generate feedback (rules, online evaluators, sampling, annotation queues) (LangChain).

The middle requirement is load-bearing. Feedback that lives in a different system than the trace breaks the join — you can describe how often users gave thumbs-down, but you cannot pull the trajectories that earned them for replay, eval seeding, or ablation.

Braintrust makes the same case from the eval side: traces and eval data belong on one surface because unifying them closes the iteration loop, rather than splitting feedback and evals into a separate analytics tool (Braintrust — Why your traces and evals belong in the same place).

Tool-Agnostic Channel: OTel gen_ai.evaluation.result

OpenTelemetry has codified the channel. The GenAI semantic conventions define a gen_ai.evaluation.result event for attaching evaluator output to a run, parallel to the inference span (OpenTelemetry GenAI events spec). Emit one event per verdict source — human thumbs-down, judge score, regex hit — each carrying the trace ID; downstream queries join on it.

This is the tool-agnostic equivalent of LangSmith's per-run feedback API or Phoenix's annotation primitives, ingestible by any backend that speaks the spec.

When This Backfires

  • Ungrounded LLM-as-judge — Judges never recalibrated against human verdicts encode their own biases — verbosity, position, self-preference — into the eval corpus. Frontier judges exceeded 50% error rates on bias benchmarks (Justice or Prejudice, arxiv 2410.02736). Treat judge output as a triage signal, not a ground-truth label.
  • Indirect-signal misattribution — A reverted diff might be a stylistic preference; a reopened ticket might be a follow-up question. Treating either as a binary failure label without causal validation poisons the corpus with false negatives.
  • Trace volume outpacing labelling capacity — High-traffic agents accumulate traces faster than humans or judges can label them; without sampling rules the trace store becomes a graveyard.
  • Feedback decoupled from the trace — Thumbs-up/down in product analytics while traces sit in an APM tool means the trace ID never appears in the analytics fact table. The loop never closes.

Example

Capturing a regex-driven frustration signal as an OTel evaluation event, attached to the same span as the agent's response:

from opentelemetry import trace

FRUSTRATION = re.compile(
    r"\b(wtf|horrible|awful|this sucks)\b",
    re.IGNORECASE,
)

def emit_frustration_feedback(span: trace.Span, user_message: str) -> None:
    """Attach a deterministic verdict to the current run span."""
    if not FRUSTRATION.search(user_message):
        return
    span.add_event(
        name="gen_ai.evaluation.result",
        attributes={
            "gen_ai.evaluation.name": "user_frustration_regex",
            "gen_ai.evaluation.score.label": "negative",
            "gen_ai.evaluation.explanation": "matched frustration regex",
        },
    )

The event sits on the same trace as the agent run, joined by trace ID. Backends that speak the OTel GenAI spec — LangSmith, Phoenix, Datadog LLM Observability, any OTLP-compatible store — surface it next to the trajectory rather than in a separate analytics tool (Datadog OTel GenAI support).

Key Takeaways

  • Traces tell you what happened; feedback tells you what it meant — only the pair powers learning.
  • Wire four feedback sources by cost: deterministic rules first, indirect user signals next, LLM-as-judge for scale, direct user verdicts where available.
  • Store feedback on the run, not in a parallel analytics tool — the join key is the trace ID.
  • Use the OTel gen_ai.evaluation.result event as the tool-agnostic channel; backends decode it natively.
  • Recalibrate LLM-as-judge against human verdicts on a schedule — ungrounded judges drift.
Feedback