Execution Lineage: DAG of Artifacts vs Agent Loops¶
Execution lineage models revisable AI work as a DAG of artifacts with explicit dependencies and identity-based replay, so unrelated edits never perturb the output.
The Maintained-State Quality Gap¶
An agent loop that interleaves reasoning, tool use, and iterative refinement can produce a polished final answer while leaving the underlying state inconsistent. Rosen and Rosen call this the gap between final answer quality and maintained-state quality — both can be measured, and they don't move together (arXiv:2605.06365).
The mechanism is implicit conversational state. When the agent revises a multi-artifact work product (a memo with sources, summaries, and conclusions; a PR with research notes, plan, and code), the loop has no structural way to say which artifacts must change, which must remain identical, and how a change should propagate. The model regenerates plausible outputs each pass and contamination leaks in from unrelated context.
The Three Structural Primitives¶
Execution lineage replaces the loop with a directed acyclic graph of artifact-producing computations and adds three properties (arXiv:2605.06365):
- Explicit dependencies — each node declares the artifacts it consumes; nothing is read implicitly from a transcript.
- Stable intermediate boundaries — intermediate artifacts (summaries, plans, draft sections) are first-class outputs with stable identity, not throwaway scratch.
- Identity-based replay — when an input changes, only descendants of the changed node re-run; everything else is reused by identity.
The mechanism is the same one that gives Make, Bazel, and asset-based orchestrators like Dagster reproducibility on data pipelines. The contribution of the paper is applying it to LLM-produced artifacts and measuring the gap empirically.
graph TD
S1[Source A] --> N1[Summarize A]
S2[Source B] --> N2[Summarize B]
S3[Source C] --> N3[Summarize C]
N1 --> P[Plan]
N2 --> P
N3 --> P
P --> D[Draft memo]
N1 --> D
N2 --> D
N3 --> D
When Source B changes, only Summarize B, Plan, and Draft memo re-run; Summarize A and Summarize C are reused. When an unrelated branch is added, none of the existing nodes re-run.
What the Experiments Showed¶
Rosen and Rosen ran two controlled policy-memo update tasks against loop-centric baselines (arXiv:2605.06365):
- Unrelated-branch update — DAG replay preserved the final memo exactly across all runs, with zero churn and zero contamination from the unrelated branch. Loop baselines regenerated the memo and frequently imported unrelated context — the context-poisoning failure mode.
- Intermediate-artifact edit — all systems reflected the new constraint in the final memo, but only DAG replay achieved upstream preservation, downstream propagation, unaffected-artifact preservation, and cross-artifact consistency.
The authors are explicit that loop baselines remain competitive on bounded one-shot synthesis where every source fits in context. The DAG earns its keep when work is revised across time.
When Loops Beat the DAG¶
The pattern is conditional, not universal. A loop is the right shape when:
- The work is one-shot — produce a PR, ship it, no further revisions.
- Fan-out is genuinely dynamic — the set of downstream sub-tasks depends on runtime decisions the agent hasn't made yet. A static DAG either over-restricts the agent or needs a meta-layer that re-introduces loop state (Static-to-Dynamic Workflow Survey).
- Tools have non-idempotent side effects —
send-email,create-pr,charge-card. DAG replay assumes a node can re-run cleanly given identical inputs; effectful nodes need an idempotency layer outside the model. - Intermediate boundaries don't factor — when artifacts share deep mutable state, "stable boundaries" don't exist and the DAG degenerates to a single mega-node.
Classical DAG schedulers also don't handle non-deterministic LLM output, reasoning-failure-as-primary-error-mode, or non-idempotent retries without explicit additions (Kinde, Orchestrating Multi-Step Agents).
Relation to Existing Patterns¶
The DAG-of-artifacts model composes with — and is distinct from — three adjacent patterns already documented:
- Cognitive Reasoning vs Execution splits what to do from how to do it via typed tool boundaries. Execution lineage operates one layer up: it structures the artifacts that flow between calls.
- Event Sourcing for Agents (ESAA) uses an append-only event log for replay-verifiable execution. The log gives temporal replay; execution lineage gives dependency-scoped replay — only descendants of changed inputs re-run.
- Durable Interactive Artifacts treats agent outputs as persistent re-openable workspace objects. Execution lineage adds the dependency edges between them.
Productized analogues are starting to ship: Cloudflare Artifacts (May 2026 beta) gives agent outputs git-like versioning with parent lineage (InfoQ coverage); Union.ai wires artifact lineage as the medium of exchange between workflows.
Example¶
A multi-file PR being revised after review feedback is the canonical use case. The naive loop:
Before — agent loop regenerates everything:
1. Read review comment "rename `User` to `Account` in module B"
2. Re-read all files, regenerate plan, regenerate diffs
3. Module A and C touched despite no requested change
4. Test fixtures reshuffled because the model re-decided shape
After — DAG replay scoped by dependency:
1. Mutate input: rename in module B's spec node
2. Re-run: module-B implementation, module-B tests, integration tests
3. Untouched: module A, module C, fixtures, lockfile
4. Final PR diff is exactly the rename plus its closure
The loop produces a polished PR that may pass review on second look. The DAG replay produces a PR whose diff is provably the closure of the requested change.
Key Takeaways¶
- Final-answer quality and maintained-state quality are distinct measurements; a polished output can mask state inconsistency that compounds over revisions.
- Adopt the DAG-of-artifacts model when work is revised across time and intermediate artifacts factor cleanly — the edges it adds between durable interactive artifacts. Stick with a loop when the task is one-shot or fan-out is genuinely runtime-decided.
- The mechanism is dependency-explicit caching with content identity — the same primitive that makes Make and Dagster reproducible, applied to LLM-produced artifacts.
- Side-effectful nodes need an idempotency layer outside the DAG model; replay alone doesn't make
send-emailsafe.