Skip to content

Agent Development Lifecycle for Agent Products

A four-phase loop — build, test, deploy, monitor — for teams whose unit of work is the agent, with verdict-labelled traces feeding the next cycle.

A Lifecycle for the Agent, Not the Feature

The Agent Development Lifecycle (ADLC) is a four-phase loop — build, test, deploy, monitor — for teams whose product is the agent, with verdict-labelled production traces feeding the next build cycle (LangChain blog).

It inverts two SDLC framings already on this project. The 7 Phases of AI Development is a feature-level workflow for using an agent to ship code; SDLC-Phase Skill Taxonomy organises a skill library so an agent acting on a codebase activates the right skills. Both treat the agent as the implement; ADLC treats it as the product.

The ordering is deliberate — test before deploy, monitor after deploy, feed learnings into the next build. Each phase produces an artifact the next consumes: scope doc, eval verdict, deploy artifact, verdict-labelled trace corpus.

graph LR
    B[Build] --> T[Test]
    T --> D[Deploy]
    D --> M[Monitor]
    M -->|verdict-labelled traces| B
    M -->|regression cases| T

The Four Phases

Build

Define scope, choose architecture, wire the harness. LangChain extends the phase beyond code, citing no-code and low-code surfaces that let non-engineers participate (LangChain blog). Produces: a runnable agent and a scope doc the test phase can score against.

Test

Score the agent against an eval suite before it touches production. Eval-Driven Development covers the discipline: define success criteria first, then build toward them. Reverse this and teams embed the live agent's bugs into the definition of correct. Produces: a pass/fail verdict and a gated deploy artifact.

Deploy

Ship the agent in a controlled way. Canary rollouts, traffic shadowing, and rollback paths apply directly — Canary Rollout for Agent Policy covers the mechanics, and deploy-time permission scoping is the other half (Permission Framework Over Model Trust). Produces: a running deployment plus the observability hooks the monitor phase consumes.

Monitor

Trace every run, label every trace with a verdict, alert on drift. Agent dashboards track usage, feedback, latency, cost, tool calls, evaluator scores, and recurring failure patterns (LangChain blog).

The verdict step is load-bearing. Traces Need Feedback to Power Learning covers the four feedback sources — deterministic rule, LLM-as-judge, indirect user signal, direct user verdict — and the OTel gen_ai.evaluation.result channel for attaching them. Without that coupling, monitor produces trajectories nobody can act on. Produces: a verdict-labelled trace corpus and a regression-case stream for the next test cycle.

Closing the Loop

Continuous Agent Improvement operationalises the Monitor → Build back-edge as an observation-to-update loop for agent configurations.

The underlying mechanism: agents fail on distributions, not on cases. Bug-fix-and-redeploy optimises one failing trace; a four-phase lifecycle with verdict-labelled traces optimises the failure-rate trend across a population. The phases are the minimum cut points where verdict-carrying signal can flow back.

When ADLC Adds Value

The lifecycle pays off when regression cost exceeds four-phase ceremony cost. That threshold rises with:

  • Multi-tenant or multi-user products where one regression affects many sessions
  • Long-horizon agents whose failure modes only surface across populations of runs
  • Teams with at least one prior regression that cost real time

When It Does Not

Failure conditions where ceremony costs more than it returns:

  • Single-agent solo team, pre-PMF: rebuild–redeploy–glance-at-logs dominates until a regression actually hurts. The four phases describe a destination, not a starting state.
  • Stateless one-shot agents: deterministic tool surfaces benefit more from classical web-service SRE than an agent-specific lifecycle.
  • Batch or cron-driven agents with no user surface: three of four feedback sources are unavailable, so monitor collapses to deterministic-rule scoring.
  • Multi-tenant agents with strict privacy constraints: trace-to-eval feedback (the Eval-Driven Development input) can violate compliance unless inputs are not persisted — significant infra cost before the loop closes.

Ship the rebuild loop first; let the four phases differentiate as failure modes surface.

Tool Mapping Is Not the Pattern

LangChain names its own stack: LangGraph for build, LangSmith for test and monitor, LangSmith Deployment for deploy (LangChain blog, Medium). Other vendors converge on the same loop shape — Domino's "Agentic AI Development Lifecycle" (NAND Research) and EPAM's "Agentic Development Lifecycle" (EPAM). The vendor stack is one instantiation; any team can wire the same lifecycle from OTel traces, an eval runner, and a deploy pipeline.

One caveat on scope: several 2026 framings treat security and governance as a first-class lifecycle concern, not a deploy-time control — prompt-injection red-teaming, governed agent catalogs, and mandatory release gates appear as intrinsic phases (Cycode, Codebridge, IBM). The loop here folds that into deploy via Permission Framework Over Model Trust; regulated or multi-tenant teams should treat governance as a cross-cutting gate on every phase, not one checkpoint.

Example

A two-person team ships a support-triage agent and wants the loop without a vendor platform:

  • Build: define scope (classify and route inbound tickets, never auto-reply), pick a single-agent harness, wire OTel tracing. Artifact: a runnable agent plus a one-page scope doc.
  • Test: 40 labelled tickets become the eval suite. CI runs the agent against them and gates merge on ≥ 90% routing accuracy — written before the agent exists, so live bugs cannot redefine "correct." Artifact: a pass/fail verdict.
  • Deploy: a canary routes 5% of live tickets through the new policy with a one-command rollback; permission scoping blocks any write path beyond the ticketing API. Artifact: a running deployment emitting traces.
  • Monitor: every run is traced and labelled — deterministic rule (did routing match the human's later reassignment?), plus a direct thumbs-up/down from the agent on duty. A weekly job converts each mis-route into a regression case (Monitor → Test) and surfaces recurring failure clusters for the next scope revision (Monitor → Build).

No LangGraph or LangSmith required — OTel, a pytest eval runner, and a feature-flagged deploy reproduce the same back-edges.

Key Takeaways

  • ADLC is a meta-lifecycle for the agent product itself — distinct from a feature-level SDLC or a skill-library SDLC; same loop shape, different unit of work.
  • The four phases — build, test, deploy, monitor — produce explicit hand-off artifacts: scope doc, eval verdict, deploy artifact, verdict-labelled traces.
  • The Monitor → Test back-edge is operationalised by an incident-to-eval pipeline; the Monitor → Build back-edge by a continuous-improvement loop.
  • The mechanism is distributional: verdict-labelled traces let teams optimise failure-rate trends, not one-off failing cases.
  • The lifecycle is not free — small teams pre-PMF, stateless one-shot agents, batch jobs with no user surface, and privacy-constrained agents should ship the collapsed rebuild loop first.
Feedback