Subagent OTel Trace Correlation via agent_id Attribute¶
Propagate a stable agent identifier on outgoing HTTP headers and every OTEL span, keeping multi-agent traces queryable by agent identity, independent of span lineage.
The Correlation Problem¶
Across 200 spans and 12 subagents, two questions span nesting alone does not answer cheaply: which spans belong to a given subagent, and which call chain caused this 429. Span hierarchy answers the first via tree traversal, but only inside a single Claude Code session. Once work crosses a boundary the instrumentation does not cover (shell-out, webhook, queue), lineage is gone.
The fix is a flat, propagated identifier that survives where the trace context ends.
The Propagation Contract¶
Claude Code 2.1.139 (changelog, May 11, 2026) instruments subagents on two surfaces simultaneously:
| Surface | Mechanism | Carried value |
|---|---|---|
| Outgoing HTTP | Request headers x-claude-code-agent-id, x-claude-code-parent-agent-id |
Subagent + parent identity |
| OTEL spans | claude_code.llm_request span attributes agent_id, parent_agent_id |
Same pair, queryable from the trace store |
From the monitoring docs: agent_id identifies the subagent that issued the request (absent on the main session); parent_agent_id identifies the agent that spawned it (absent for the main session and for agents spawned directly from it).
The pair is load-bearing. agent_id supports "all work by this subagent"; parent_agent_id reconstructs the dispatch hierarchy from a flat query — "which subagent spawned this 429-emitting child" — without walking the span tree.
graph TD
M[Main session<br/>agent_id absent] -->|spawns| A[Subagent A<br/>agent_id=A]
M -->|spawns| B[Subagent B<br/>agent_id=B]
A -->|spawns| C[Subagent C<br/>agent_id=C<br/>parent_agent_id=A]
B -->|spawns| D[Subagent D<br/>agent_id=D<br/>parent_agent_id=B]
C -->|HTTP request| API[API<br/>x-claude-code-agent-id: C<br/>x-claude-code-parent-agent-id: A]
Why Two Surfaces¶
Span lineage answers "what is the call structure inside this turn" — it needs the parent span live when the child starts. The propagated attribute answers "what work was caused by this agent identity, regardless of dispatch path" — it needs only that the identifier be copied onto every emission. OpenTelemetry separates trace context (traceparent) from cross-cutting context (span attributes, Baggage) for the same reason: neither alone suffices.
Queries the Pattern Enables¶
With agent_id on every span and API event, the trace store becomes a per-agent analytics surface:
| Question | Query shape |
|---|---|
| Token cost per subagent | sum(input_tokens + output_tokens) group by agent_id |
| p99 latency per subagent | quantile(0.99, duration_ms) group by agent_id |
| Error rate per subagent | count(status='ERROR') / count(*) group by agent_id |
| Call chain to a 429 | follow parent_agent_id from the failing span upward |
| Cost of an orchestrator's tree | sum(cost) where agent_id IN (recursive descendants of root) |
Without the attribute, the same queries require walking parent links per-trace — feasible at one trace, expensive at scale.
Complementary Attributes¶
The contract sits within the attributes Claude Code emits on claude_code.llm_request spans (attribute table):
| Attribute | What it identifies |
|---|---|
agent_id / parent_agent_id |
Per-instance identity (high cardinality — span attribute, not metric label) |
agent.name |
Subagent type — bounded set; safe as a metric label. User-defined names replaced with "custom" |
query_source |
"main", "subagent", or "auxiliary" |
skill.name, plugin.name |
Skill or plugin active for the request |
agent.name is the metric-safe partner: dashboards aggregate by it to avoid cardinality explosions; trace queries drill into a specific incident via agent_id.
Where the Propagation Breaks¶
The contract holds only over surfaces Claude Code controls — off-protocol egress paths escape it:
- Shell-out via Bash tool:
curl -X POST https://api.example.comproduces aclaude_code.toolspan, but the request carries nox-claude-code-agent-idheader — the service sees an anonymous request. - Subprocess work:
TRACEPARENTis auto-set for W3C inheritance, but nothing copiesagent_idinto subprocess env. Subprocess spans inherit the trace, not the agent identity. - Fire-and-forget queues: enqueueing discards the header; the work runs untagged.
Mitigation: lift the call into an MCP tool, or wrap shell-outs with an explicit x-claude-code-agent-id header.
When This Backfires¶
- Treating
agent_idas a metric label: per-instance identifiers create unbounded time series. The Agent Observability OTel page documents the same discipline forprompt.id. Useagent.namefor metric slicing. - Reusing identifier values across sessions: if
agent_idcollides across unrelated runs, per-agent queries silently mix sessions. Identifiers must be globally unique within the trace store's retention window. - Single-agent harnesses: every span has
agent_idabsent. The attribute pays for itself only when fan-out exists. - Privacy assumption mismatch:
agent_idis opaque, butagent.namecarries the subagent type verbatim for built-in and official-marketplace agents. Custom names are redacted to"custom"— yet skill or plugin names on the same span may still leak intent (attribute redaction rules).
Example¶
A team running Claude Code with OTLP export to Tempo and Prometheus debugs a 429 storm during a fan-out.
Before — without per-agent attribution, only the parent's tool_decision event shows the dispatch; per-subagent cost requires manual span-tree traversal per trace, infeasible across thousands of traces:
claude_code.interaction (root)
└── claude_code.tool (Task)
├── claude_code.llm_request # which subagent? unknown without tree walk
├── claude_code.llm_request # 429 here. caused by which dispatch path?
└── claude_code.llm_request
After — with agent_id / parent_agent_id on every span, the trace store answers both questions with a flat query:
claude_code.llm_request agent_id=fanout-3 parent_agent_id=orch status=ERROR status_code=429
claude_code.llm_request agent_id=fanout-3 parent_agent_id=orch input_tokens=12450
claude_code.llm_request agent_id=fanout-4 parent_agent_id=orch input_tokens=11820
The 429 belongs to fanout-3, spawned by orch. A cost-by-agent.name panel in Grafana attributes the burst to the research-topic subagent type. The fix — lower N for research-topic specifically — lands without instrumenting application code.
Key Takeaways¶
- The contract is dual-surface: HTTP header on outgoing requests, span attribute on local emissions. Each covers a boundary the other does not.
agent_idplusparent_agent_idis the load-bearing pair — flat per-instance identity plus a parent back-pointer reconstructs the hierarchy without span-tree traversal.- Use
agent_idas a span/event attribute, never a metric label. Useagent.namefor bounded-cardinality metric slicing. - The pattern breaks at off-protocol boundaries — shell-out to
curl, fire-and-forget queues, raw subprocess spawn. Close the gap by lifting those into instrumented tool calls. - The pattern generalises beyond Claude Code: any harness that propagates a stable agent identifier through HTTP headers and span attributes inherits the same query surface.