Feature List Files¶

Maintain a JSON feature list with per-feature status and acceptance criteria; agents work it sequentially and cannot mark a feature passing until its criteria pass.

The Problem with Agent Self-Report¶

Agents left to self-report progress are optimistic. Anthropic's harness post documents a specific failure mode: "after some features had already been built, a later agent instance would look around, see that progress had been made, and declare the job done." Without an external contract, an agent marks a feature complete on whether the implementation looks plausible — not whether it passes acceptance criteria, so partially implemented work gets labeled done and the error compounds across sessions.

A machine-readable feature list with explicit pass/fail status replaces self-report with a verifiable contract — the foundation of reliable long-running agent work.

Structure¶

Each feature entry includes:

{
  "id": "feature-42",
  "description": "User can reset password via email link",
  "status": "failing",
  "acceptance_criteria": [
    "POST /auth/reset-request returns 200 with valid email",
    "Reset link expires after 1 hour",
    "Password changed successfully via link"
  ]
}

All features start with status: "failing". The agent cannot change status to passing unless the acceptance criteria are met.

Agent Contract¶

The system prompt makes the feature list the system of record. This aligns with LangChain's harness engineering findings: "the more that agents know about their environment, constraints, and evaluation criteria, the better they can autonomously self-direct their work."

Work through features in priority order
Select the highest-priority failing feature
Implement it, run verification, and check each criterion
Update status to passing only when all criteria pass
Commit and move to the next feature

The explicit instruction from Anthropic's harness post is blunt: "it is unacceptable to remove or edit tests because this could lead to missing or buggy functionality." This rule must appear verbatim in the agent's instructions.

Pass-Gate Policy¶

"Agent did not self-promote" is not enough — the harness needs explicit commit semantics for the failing → passing transition. The walkinglabs harness-engineering lecture on feature lists specifies four conditions, all required:

The expected workflow has been exercised end-to-end
The evidence of success is recorded
No blocking error is present in the tested path
The implementation does not leave the app in a broken or ambiguous state

Treat the gate as a database trigger, not an application-level check: a separate verifier runs the conditions on each transition attempt and rejects the write if any fail. Conditions 1–2 defeat the tests-pass-but-feature-doesn't-work failure mode Anthropic documents; condition 4 forces the agent to account for half-applied migrations, partial config edits, and inconsistent background-job state.

Feature List as State Machine¶

The feature list defines four states and one allowed transition path:

not_started — scheduler selects from here
active — work in progress; verifier re-runs the four conditions
blocked — a prerequisite is unmet; resolved by working a dependency first
passing — verifier accepted the transition

Per the walkinglabs lecture, active → passing is irreversible and only the verifier can write it. The list outranks conversation history: if the transcript says a feature works but the list says failing, the list wins. The agent re-reads it every turn — recorded state, not remembered state. Four harness roles operate on the same list: scheduler, verifier, handoff reporter, and progress tracker — none of them the agent itself.

Validation Strategy¶

Unit tests alone are insufficient for many feature validations. Per the Anthropic harness post: "Claude tended to make code changes, and even do testing with unit tests or curl commands against a development server, but would fail to recognize that the feature didn't work end-to-end." Browser automation (Playwright, Puppeteer) validates user-facing features the way a human user would.

Each feature entry's acceptance criteria should specify which validation method applies: automated test suite, browser automation, or both.

The Feature List as Cross-Session State¶

The feature list is the primary source of truth for multi-session progress. At the start of each session, the agent reads:

The feature list — what is passing, what is failing, what is next
git log — what was committed in prior sessions
Progress notes — any blockers or context from previous sessions

This triad replaces context window memory. The feature list does not forget, does not optimistically round up, and does not lose information when context is compacted.

Scale¶

Anthropic's harness post documents using 200+ features defined upfront, all initial statuses set to failing. At this scale the feature list becomes a project management artifact as well as an agent contract — it shows scope, progress, and remaining work in one format both agents and humans read.

When This Backfires¶

Feature list files add overhead that outweighs the benefit in several situations:

Exploratory or greenfield work: when requirements are unknown upfront, writing acceptance criteria before any implementation forces premature specification. The list becomes a fiction that diverges from what actually gets built.
Short single-session tasks: for a task completing in one session with no state continuity needed, a feature list is unnecessary scaffolding — the same threshold at which a frozen spec file stops paying off. The agent's own working memory is sufficient.
Criteria that can be gamed: acceptance criteria defined as command outputs can be satisfied by hardcoding responses. A poorly designed criterion (task list shows a checkmark) can be met without the underlying feature working correctly — the list gives false confidence rather than genuine verification.
Feature list staleness: in fast-moving projects requirements shift mid-build. Without a human reviewing and updating the list between sessions, the agent dutifully implements outdated entries while actual priorities diverge.
Scale without decomposition: at 200+ features a flat list with no dependency ordering forces the agent to attempt entries whose prerequisites have not been built. Priority order alone does not capture dependency graphs.

Example¶

A CLI task-management app uses a feature list to drive a multi-session agent build. The file features.json is committed to the repo root:

[
  {
    "id": "feat-1",
    "description": "Create a task with title and due date",
    "status": "failing",
    "acceptance_criteria": [
      "Running `task add 'Buy milk' --due 2025-04-01` creates a task in tasks.db",
      "Running `task list` shows the new task with its due date"
    ]
  },
  {
    "id": "feat-2",
    "description": "Mark a task as complete",
    "status": "failing",
    "acceptance_criteria": [
      "Running `task done 1` sets task 1 status to complete in tasks.db",
      "Running `task list` shows task 1 with a checkmark"
    ]
  },
  {
    "id": "feat-3",
    "description": "Filter tasks by status",
    "status": "failing",
    "acceptance_criteria": [
      "Running `task list --status=pending` shows only incomplete tasks",
      "Running `task list --status=done` shows only completed tasks"
    ]
  }
]

The system prompt references the file directly:

You are building a CLI task manager. Your contract is features.json in the repo root.

1. Read features.json at the start of every session.
2. Select the highest-priority feature with status "failing".
3. Implement it, then run the acceptance criteria as shell commands.
4. Only set status to "passing" if every criterion succeeds.
5. Commit the code and the updated features.json together.
6. It is unacceptable to remove or edit tests because this could lead to missing or buggy functionality.

After the agent completes feat-1, the file updates in place — feat-1 moves to "status": "passing" and the agent proceeds to feat-2. A new session reads the same file and picks up exactly where the previous session stopped.

Key Takeaways¶

All features start as failing; agents cannot self-promote status without passing acceptance criteria
The prohibition on editing or removing tests must be explicit in the system prompt
Use browser automation alongside unit tests for user-facing feature validation
The feature list is the cross-session source of truth — not git log or agent self-report
Define features upfront with end-to-end descriptions before any implementation begins