Risk-Based Task Sizing for Agent Verification Depth¶

Scale verification effort to match task risk — trivial changes get quick checks, high-risk changes get multi-model adversarial review and human approval gates.

The Problem¶

Most agent workflows apply uniform verification: every change runs the same checks regardless of whether it touches a comment or an auth module. Low-risk changes waste cycles; high-risk changes pass with insufficient scrutiny because the bar is set for average tasks.

File Risk Classification¶

The Anvil agent classifies files into three risk tiers based on what they control:

Tier	Scope	Examples
Low	Additive, no behavioral change	New tests, documentation, config comments
Medium	Existing behavior modified	Business logic, function signatures, database queries, UI state
High	Security or data integrity surface	Auth, crypto, payments, data deletion, schema migrations, public API

Classification is static per file — determined by what the file controls, not the current change. A one-line edit to an authentication module stays high-risk because the blast radius of a mistake there is large.

Task Sizing¶

Task size combines scope and file risk:

Size	Verification	Review
Small	IDE diagnostics + syntax check only	None
Medium	Full verification cascade + structured ledger	1 reviewer
Large	Full cascade + operational readiness checks	3 cross-model reviewers + human gate

The Anvil agent applies one escalation rule: high-risk files auto-escalate to Large regardless of scope. A typo fix in a payments module triggers the full pipeline because the file's risk tier overrides apparent simplicity.

The default heuristic is "if unsure, treat as Medium" — err toward more verification, not less.

Verification Cascade¶

Verification is tiered with fallback layers:

IDE diagnostics — always run on changed files and their importers
Syntax/parse check — the file must parse without errors
Build/compile — run if build tooling exists
Type checker — run on changed files
Linter — run on changed files only
Test suite — full suite or relevant subset
Import/load test — verify the module loads without crashing (fallback when tiers 3-6 produce no runtime signal)
Smoke execution — a throwaway script exercising the changed code path (fallback when no other runtime verification exists)

The Anvil agent requires at least one tier 7-8 check when tiers 1-6 yield only static signals. Empty runtime verification is never acceptable.

Structured Verification Ledger¶

Every verification step is recorded as structured data — an INSERT, not prose. The evidence bundle shown to the developer is a SELECT, not a self-reported summary. This prevents hallucinated verification: if the INSERT did not happen, the check did not happen. See Verification Ledger for the full pattern.

The ledger captures baseline and post-change state, enabling regression detection by comparing the two phases programmatically.

When This Backfires¶

Risk-tier systems inherit the weaknesses of risk-based testing: subjective assignment, classification drift, and under-testing of low-risk areas. Specific conditions where the pattern underperforms uniform verification:

Tier drift after refactors. Static per-file tiers assume file purpose is stable. A test helper that accretes production code over six months may still be tagged Low. Teams routinely stop updating risk matrices once maintenance cost exceeds perceived benefit (TestRail, "Pros and Cons of Risk-Based Testing").
Subjective classification. Two engineers can reasonably disagree whether a billing calculator is "business logic" (Medium) or "data integrity surface" (High). Without a rubric enforced in review, tier assignments drift and create a false sense of rigor (Technology.org, "Benefits and disadvantages of risk-based testing").
High-risk-file fatigue. Auto-escalating every touch of an auth file discourages defensible small improvements — typo fixes, comment updates, dead-code removal. Teams route around the policy by avoiding the file.
Low-tier blind spots. Concentrating effort on High-tier files under-weights defects from interactions between Low-tier modules — the same cross-cutting-interaction risk risk-based shipping flags. A documentation change that invalidates a runbook can cause an incident the tiered cascade never catches.

If the risk map is not reviewed regularly, or the team lacks a shared rubric, uniform verification may be more honest than a stale tier map masquerading as risk awareness.

Key Takeaways¶

Classify files by what they control (data integrity, security surface), not by change size
High-risk files auto-escalate verification regardless of apparent task simplicity
Default to Medium when uncertain — over-verification is cheaper than missed defects
Tier verification with fallbacks so every change gets at least one runtime signal
Record verification as structured data, not prose — queryable evidence over self-reported claims

Example¶

A coding agent receives a task: "Add a --dry-run flag to the deploy CLI command." The agent identifies the changed files and classifies each:

File	Risk Tier	Reason
`cli/deploy.py`	High	Controls production deployment — data integrity surface
`cli/flags.py`	Medium	Modifies existing CLI argument parsing
`tests/test_deploy.py`	Low	Additive test — no behavioral change

The highest file risk is High, so the task auto-escalates to Large regardless of the change's apparent simplicity. The agent runs the full verification cascade:

✓ Tier 1 — IDE diagnostics: 0 errors in deploy.py, flags.py, test_deploy.py
✓ Tier 2 — Syntax check: all files parse
✓ Tier 3 — Build: `pip install -e .` succeeds
✓ Tier 4 — Type checker: `mypy cli/ tests/` passes
✓ Tier 5 — Linter: `ruff check cli/ tests/` clean
✓ Tier 6 — Tests: `pytest tests/test_deploy.py` — 14 passed, 0 failed

Tiers 3-6 produced runtime signal, so tiers 7-8 are skipped. The agent records each result as a structured ledger entry and routes the change to three cross-model reviewers plus a human approval gate — the full Large-task pipeline.

If the same task only touched tests/test_deploy.py (Low tier), it would stay Small: IDE diagnostics, syntax check, no review required.