Skip to content

Structured Agentic Software Engineering

Structured agentic software engineering closes the gap between agent speed and human trust with durable artifacts — not faster models.

The speed-vs-trust gap

Agents are fast but unreliable where code meets review.

Metric Value Source
Median agent PR turnaround 13.2 minutes arXiv:2509.06216
Plausible fixes that introduce regressions 29.6% arXiv:2509.06216
SWE-Bench solve rate drop after manual audit 12.47% to 3.97% arXiv:2509.06216
Agent PRs that face long delays or remain unreviewed >68% arXiv:2509.06216

The bottleneck is verification, not generation. An independent replication found roughly half of SWE-bench-passing PRs would be rejected by real maintainers, most often for functional failure (METR, 2026).

SE maturity levels

A maturity model analogous to SAE driving automation:

graph LR
    L0["SE 0<br/>Manual"] --> L1["SE 1.0<br/>Tool-Assisted"]
    L1 --> L2["SE 2.0<br/>AI-Augmented"]
    L2 --> L3["SE 3.0<br/>Goal-Agentic"]
    L3 --> L4["SE 4.0<br/>Domain-Autonomous"]
    L4 --> L5["SE 5.0<br/>General-Autonomous"]
    style L3 stroke:#f90,stroke-width:3px

SE 3.0 (Goal-Agentic) is the current frontier: the agent receives a goal, decomposes it, executes with tools, and iterates under human oversight. SE 4.0 and 5.0 remain research targets. This parallels the AI development maturity model.

Two environments

SASE splits developer and agent workspaces:

Agent Command Environment (ACE) — the human command center for triaging MRPs and CRPs, setting goals, and reviewing evidence.

Agent Execution Environment (AEE) — the agent workbench: AST-level tools, semantic search, and MCP servers, scoped by permissions. It extends agent-first software design.

The split mirrors the cognitive-execution separation: ACE decides, AEE executes.

Structured artifacts

The core contribution: replace ephemeral chat with durable artifacts.

BriefingScript

A mission specification — intent, success criteria, constraints, and a solution blueprint — what spec-driven development calls the frozen spec, elevated to a formal artifact (arXiv:2509.06216).

MentorScript

Team norms in machine-readable form — the structured counterpart to AGENTS.md and CLAUDE.md files (see instruction file ecosystem and AGENTS.md standard).

Merge-Readiness Pack (MRP)

An evidence bundle on each PR: test results, coverage, static analysis, rationale, and audit trail. Formalizes verification-centric development — review the evidence, not the diff — and extends tiered code review with progressive disclosure.

Consultation Request Pack (CRP)

Structured agent-to-human escalation: the agent packages context, options, and a recommendation; the human replies with a Version Controlled Resolution (VCR) that persists for future sessions. Operationalizes human-in-the-loop with a concrete artifact.

LoopScript

Repeatable workflow and SOP definitions — analogous to CI/CD pipeline specs but for agent workflows.

Practical implications

Specification is the new implementation. BriefingScript quality reduces agent rework — see the frozen spec file.

Review evidence, not diffs. MRPs shift review from reading every line to verifying the evidence chain, which addresses the review bottleneck (arXiv:2509.06216).

Instruction files need structure. MentorScript argues that freeform files (AGENTS.md, CLAUDE.md, .cursorrules) should evolve toward machine-readable formats.

Why it works

Reviewers cannot trust output they cannot audit, and agents cannot improve without durable feedback. BriefingScripts and LoopScripts give agents machine-readable contracts that cut the ambiguity driving rework. MRPs give reviewers an auditable evidence chain instead of a line-by-line diff. CRPs with Version Controlled Resolutions stop the same escalation recurring by persisting decisions as referenceable context (arXiv:2509.06216).

When this backfires

SASE adds process overhead that can exceed its benefits:

  • Small teams or early-stage projects: authoring BriefingScripts and MRPs rarely pays off when reviewers hold full context.
  • Ill-defined requirements: structured artifacts assume stable goals. When requirements shift mid-task, the BriefingScript becomes a constraint and agents over-optimize for the original spec — the rigidity risk spec-driven development carries.
  • Low-trust agent pipelines: MRPs are evidence bundles, not correctness proofs. At a 29.6% regression rate on "plausible" fixes, a polished evidence package can manufacture false confidence.
  • Tooling immaturity: ACE/AEE separation needs agents that consume structured artifacts reliably. Model adherence to structured formats varies.

Key Takeaways

  • The speed-vs-trust gap — not model capability — is the defining constraint of SE 3.0
  • Structured artifacts (BriefingScript, MRP, CRP, MentorScript, LoopScript) replace ephemeral chat with durable, reviewable contracts
  • The ACE/AEE environment split mirrors the cognitive-execution separation at the workspace level
  • Most SASE proposals have informal equivalents in practice (frozen specs, AGENTS.md, evidence-based review) — the contribution is naming and structuring them

Example

A BriefingScript for a bug-fix task, structured as the agent's input contract:

briefing:
  intent: "Fix race condition in session cleanup that causes orphaned locks"
  success_criteria:
    - "All sessions release locks within 30s of disconnect"
    - "No orphaned lock warnings in 24h soak test"
    - "Existing session tests pass without modification"
  context:
    repo: "acme/session-service"
    files:
      - "src/session/cleanup.rs"
      - "src/session/lock_manager.rs"
    related_issues: ["#1042", "#987"]
  constraints:
    - "Do not change the public API surface"
    - "Prefer timeout-based cleanup over heartbeat polling"
  blueprint:
    approach: "Add a cleanup sweep on a 30s interval that force-releases locks older than the session TTL"
    risk: "Sweep interval must not conflict with the existing GC timer in lock_manager.rs"

The corresponding MRP attached to the agent's PR would include test results, static analysis output, and the rationale linking each change back to the success criteria — giving the reviewer an evidence chain instead of a raw diff.

Feedback