Structured Agentic Software Engineering¶
Structured agentic software engineering closes the gap between agent speed and human trust with durable artifacts — not faster models.
The speed-vs-trust gap¶
Agents are fast but unreliable where code meets review.
| Metric | Value | Source |
|---|---|---|
| Median agent PR turnaround | 13.2 minutes | arXiv:2509.06216 |
| Plausible fixes that introduce regressions | 29.6% | arXiv:2509.06216 |
| SWE-Bench solve rate drop after manual audit | 12.47% to 3.97% | arXiv:2509.06216 |
| Agent PRs that face long delays or remain unreviewed | >68% | arXiv:2509.06216 |
The bottleneck is verification, not generation. An independent replication found roughly half of SWE-bench-passing PRs would be rejected by real maintainers, most often for functional failure (METR, 2026).
SE maturity levels¶
A maturity model analogous to SAE driving automation:
graph LR
L0["SE 0<br/>Manual"] --> L1["SE 1.0<br/>Tool-Assisted"]
L1 --> L2["SE 2.0<br/>AI-Augmented"]
L2 --> L3["SE 3.0<br/>Goal-Agentic"]
L3 --> L4["SE 4.0<br/>Domain-Autonomous"]
L4 --> L5["SE 5.0<br/>General-Autonomous"]
style L3 stroke:#f90,stroke-width:3px
SE 3.0 (Goal-Agentic) is the current frontier: the agent receives a goal, decomposes it, executes with tools, and iterates under human oversight. SE 4.0 and 5.0 remain research targets. This parallels the AI development maturity model.
Two environments¶
SASE splits developer and agent workspaces:
Agent Command Environment (ACE) — the human command center for triaging MRPs and CRPs, setting goals, and reviewing evidence.
Agent Execution Environment (AEE) — the agent workbench: AST-level tools, semantic search, and MCP servers, scoped by permissions. It extends agent-first software design.
The split mirrors the cognitive-execution separation: ACE decides, AEE executes.
Structured artifacts¶
The core contribution: replace ephemeral chat with durable artifacts.
BriefingScript¶
A mission specification — intent, success criteria, constraints, and a solution blueprint — what spec-driven development calls the frozen spec, elevated to a formal artifact (arXiv:2509.06216).
MentorScript¶
Team norms in machine-readable form — the structured counterpart to AGENTS.md and CLAUDE.md files (see instruction file ecosystem and AGENTS.md standard).
Merge-Readiness Pack (MRP)¶
An evidence bundle on each PR: test results, coverage, static analysis, rationale, and audit trail. Formalizes verification-centric development — review the evidence, not the diff — and extends tiered code review with progressive disclosure.
Consultation Request Pack (CRP)¶
Structured agent-to-human escalation: the agent packages context, options, and a recommendation; the human replies with a Version Controlled Resolution (VCR) that persists for future sessions. Operationalizes human-in-the-loop with a concrete artifact.
LoopScript¶
Repeatable workflow and SOP definitions — analogous to CI/CD pipeline specs but for agent workflows.
Practical implications¶
Specification is the new implementation. BriefingScript quality reduces agent rework — see the frozen spec file.
Review evidence, not diffs. MRPs shift review from reading every line to verifying the evidence chain, which addresses the review bottleneck (arXiv:2509.06216).
Instruction files need structure. MentorScript argues that freeform files (AGENTS.md, CLAUDE.md, .cursorrules) should evolve toward machine-readable formats.
Why it works¶
Reviewers cannot trust output they cannot audit, and agents cannot improve without durable feedback. BriefingScripts and LoopScripts give agents machine-readable contracts that cut the ambiguity driving rework. MRPs give reviewers an auditable evidence chain instead of a line-by-line diff. CRPs with Version Controlled Resolutions stop the same escalation recurring by persisting decisions as referenceable context (arXiv:2509.06216).
When this backfires¶
SASE adds process overhead that can exceed its benefits:
- Small teams or early-stage projects: authoring BriefingScripts and MRPs rarely pays off when reviewers hold full context.
- Ill-defined requirements: structured artifacts assume stable goals. When requirements shift mid-task, the BriefingScript becomes a constraint and agents over-optimize for the original spec — the rigidity risk spec-driven development carries.
- Low-trust agent pipelines: MRPs are evidence bundles, not correctness proofs. At a 29.6% regression rate on "plausible" fixes, a polished evidence package can manufacture false confidence.
- Tooling immaturity: ACE/AEE separation needs agents that consume structured artifacts reliably. Model adherence to structured formats varies.
Key Takeaways¶
- The speed-vs-trust gap — not model capability — is the defining constraint of SE 3.0
- Structured artifacts (BriefingScript, MRP, CRP, MentorScript, LoopScript) replace ephemeral chat with durable, reviewable contracts
- The ACE/AEE environment split mirrors the cognitive-execution separation at the workspace level
- Most SASE proposals have informal equivalents in practice (frozen specs, AGENTS.md, evidence-based review) — the contribution is naming and structuring them
Example¶
A BriefingScript for a bug-fix task, structured as the agent's input contract:
briefing:
intent: "Fix race condition in session cleanup that causes orphaned locks"
success_criteria:
- "All sessions release locks within 30s of disconnect"
- "No orphaned lock warnings in 24h soak test"
- "Existing session tests pass without modification"
context:
repo: "acme/session-service"
files:
- "src/session/cleanup.rs"
- "src/session/lock_manager.rs"
related_issues: ["#1042", "#987"]
constraints:
- "Do not change the public API surface"
- "Prefer timeout-based cleanup over heartbeat polling"
blueprint:
approach: "Add a cleanup sweep on a 30s interval that force-releases locks older than the session TTL"
risk: "Sweep interval must not conflict with the existing GC timer in lock_manager.rs"
The corresponding MRP attached to the agent's PR would include test results, static analysis output, and the rationale linking each change back to the success criteria — giving the reviewer an evidence chain instead of a raw diff.
Related¶
- Cognitive Reasoning vs Execution Separation — ACE/AEE maps to the two-layer architecture
- Spec-Driven Development — BriefingScript aligns with the frozen spec
- Verification-Centric Development — MRPs extend evidence-based verification
- Tiered Code Review — progressive disclosure of review evidence
- Human-in-the-Loop — CRPs formalize structured escalation
- Instruction File Ecosystem — MentorScript formalizes instruction files
- AI Development Maturity Model — team-adoption maturity paralleling SE levels
- Agentless vs Autonomous — counterpoint: when simpler workflows outperform structured agentic approaches