Classifier-Subagent Run Mode for Per-Call Permission Routing¶

A run mode routes each shell, MCP, and fetch call through allowlist, sandbox, then classifier subagent steered by project instructions.

It applies when the harness exposes a per-call routing seam, the workload is high-volume routine work, admin-enforced configuration is in place, and irreversible actions are pinned to a deterministic gate elsewhere. Outside those conditions, a static allowlist plus sandbox plus mandatory human approval is the lower-failure-surface posture.

What the pattern routes¶

Cursor, Anthropic, and OpenAI ship the same shape: every side-effecting tool call (shell, MCP, fetch) passes a three-stage router before execution.

Tier	What runs	What decides
1 — Allowlist	Curated commands and tools	Deterministic string-match against `permissions.json` / `settings.json`
2 — Sandbox	Calls the harness can isolate (filesystem + network limits)	Deterministic capability check against the sandbox profile
3 — Classifier subagent	Everything else	LLM verdict steered by project-specific instructions

Cursor's 3.6 release (2026-05-29) ships this as the default "Auto-review" Run Mode: "Allowlisted calls run immediately. Other calls run in the sandbox when possible" — the remainder route to a classifier (Cursor changelog; Cursor terminal docs). Claude Code's auto mode wires the same three tiers plus a PermissionDenied hook for retry-on-block (Anthropic engineering). OpenAI Codex's approvals_reviewer = "auto_review" redirects boundary-crossing decisions to a separate reviewer agent (Codex auto-review docs).

graph TD
    A[Tool call proposed] --> B{Allowlist match?}
    B -->|Yes| C[Execute]
    B -->|No| D{Sandboxable?}
    D -->|Yes| E[Execute in sandbox]
    D -->|No| F[Classifier subagent]
    F -->|Allow| C
    F -->|Suggest alternative| G[Return rationale to doing agent]
    F -->|Escalate| H[Human approval prompt]
    C --> I[Audit log]
    E --> I
    G --> A
    H --> I

The three-way decision space¶

Cursor's 3.6 classifier is the first shipped implementation returning one of three verdicts rather than a binary allow/block:

Allow — the call proceeds outside the sandbox.
Suggest alternative — the harness rejects the call and returns a rationale to the doing agent, which retries differently. Codex's internal-traffic snapshot reports the agent recovered safely on 4 of 7 reviewer denials this way (OpenAI alignment: auto-review).
Escalate — the call surfaces to the user as a synchronous approval prompt.

The third path is load-bearing. Binary classifiers collapse retryable mistakes into hard stops, reintroducing the rubber-stamp failure mode that human-in-the-loop confirmation gates — and the classifier — exist to defeat. The three-way space treats the classifier as a router between agent and human, not a final arbiter.

The classifier can issue verdicts the doing agent cannot because it sees a different context. OpenAI: "The main agent is optimized to complete the user's task. This creates pressure to treat an approval boundary as just another obstacle to overcome. Auto-review has a narrower job" (alignment.openai.com/auto-review/). Anthropic calls the same property "reasoning-blind by design" — Claude Code strips the agent's own messages and tool outputs from the classifier's context "so the agent can't talk the classifier into making a bad call" (Anthropic engineering).

Custom-instructions steering¶

The classifier is portable across very different repos only because its decision policy is project-configurable. Cursor exposes this through permissions.json's autoRun block:

{
  "autoRun": {
    "allow_instructions": [
      "Read-only inspections of the staging Kubernetes cluster via kubectl get",
      "Database migrations against the dev branch in db/migrations/"
    ],
    "block_instructions": [
      "Any call that writes to terraform/ outside the modules/ subtree",
      "Network egress to domains not listed in sandbox.json"
    ]
  }
}

Both fields are arrays of free-form sentences; per-user and per-repo entries concatenate (Cursor permissions.json reference). Codex exposes the same mechanism through [auto_review] policy = """...""" in config.toml, with guardian_policy_config for enterprise managed settings (Codex auto-review docs). Claude Code uses block rules in managed settings with admin-enforced precedence (Anthropic engineering).

The classifier's training distribution covers generic risk signals, but a Terraform-heavy infra repo and a React app have non-overlapping definitions of "boundary-crossing." Project-specific instructions adapt the classifier to each without retraining or forking the harness.

Why it works¶

The pattern works because it factors the per-call decision into a deterministic outer layer (allowlist plus sandbox) and a probabilistic inner layer (classifier subagent) — and because the classifier sees a different context than the doing agent. Anthropic's auto-mode classifier is a two-stage pipeline: a single-token fast filter tuned for recall, then chain-of-thought reasoning only on flagged calls, so per-call cost stays bounded under high-volume traffic (Anthropic engineering).

The empirical effect is the rubber-stamp problem inverted. Anthropic reports Claude Code users approved 93% of permission prompts before auto mode shipped (Anthropic engineering). Codex's internal-traffic snapshot reports the manual mode it replaces would have interrupted users 720 times per 10,000 actions; in auto-review the reviewer approved 713, denied 7, and the agent recovered safely on 4 of those 7 — sessions stop for approval "roughly 200x less often" (alignment.openai.com/auto-review/). The user sees only the residual surface the three tiers cannot absorb.

OpenAI names the boundary clearly: auto-review is "a reviewer swap, not a permission grant. It does not expand writable_roots, enable network access, or weaken protected paths" (Codex auto-review docs). The classifier decides who reviews the call; the sandbox and managed-settings boundaries decide what is reachable at all.

When this backfires¶

In-project ops state. When production-affecting files (Terraform state, sealed-secrets, k8s manifests) live inside the project root, the Tier-2 sandbox lets the agent edit them without ever reaching the classifier — an independent stress test measured 36.8% of state-changing actions bypassing classification this way (Ji et al., 2026: Measuring the Permission Gate). Move the state outside the root or pin the paths with deterministic deny rules.
Out-of-distribution workload. Anthropic's 17% false-negative rate is on a hand-curated calibration set. Ji et al. measured 81.0% FNR (95% CI 73.8%–87.4%) on 128 ambiguous DevOps prompts and 92.9% FNR on artifact-cleanup tasks (arxiv:2604.04978). Vendor numbers are workload-conditional, not universal.
Untrusted configuration surface. If an attacker can write .cursor/permissions.json, .vscode/settings.json, or local rule files, the scheme collapses to whichever layer the attacker reaches first (Backslash: The Denylist Delusion). Admin-enforced managed settings are not optional.
Same model class for agent and classifier. A same-class evaluator is a biased reviewer of its own work: LLM evaluators both recognize and favor their own generations, with self-recognition correlating linearly with self-preference bias (Panickssery et al., 2024: LLM Evaluators Recognize and Favor Their Own Generations). The shared training distribution also means a jailbreak that defeats the agent tends to defeat the classifier. Use a different model class, or at minimum reasoning-blind the classifier.
Sandbox masks the allowlist. Cursor users report that under "Auto-Run in Sandbox" the user-curated command allowlist is silently bypassed (Cursor forum). Verify precedence empirically before trusting the documented hierarchy.
Irreversible actions on the classifier path. Production deploys, money movement, IAM rotation, and credential exposure do not belong on a probabilistic auto-approval gate — they belong on deterministic deny rules with mandatory human approval. Cursor's own docs state the classifier "is non-deterministic and can make mistakes in both directions" (Cursor terminal Run Mode docs).
The trifecta does not close. The classifier narrows but does not close the egress decision. An agent with private-data access, untrusted content (a fetched page or MCP-returned document), and any allowlisted shell can still be steered into exfiltration the classifier permits because the call shape looks routine. The fix is removing one leg of the trifecta, not a smarter filter (Lethal Trifecta Threat Model; Simon Willison: the lethal trifecta).

Cross-tool comparison¶

The three shipped implementations differ in surface but share the three-tier shape:

	Cursor 3.6 Auto-review	Claude Code auto mode	Codex auto-review
Default run mode	Auto-review (3.6+)	Opt-in `--mode auto`	`approvals_reviewer = "auto_review"` in `config.toml`
Classifier verdict	Allow / suggest alternative / escalate	Allow / block (retry via `PermissionDenied` hook)	Allow / deny (agent receives rationale and may retry)
Reasoning isolation	Classifier prompt receives user input plus `autoRun` instructions	"Reasoning-blind by design" — agent reasoning stripped	"Compact transcript plus the exact approval request... Hidden assistant reasoning is not included"
Steering surface	`permissions.json` `autoRun.allow_instructions` / `block_instructions`	Block rules in managed settings; `hard_deny` for org-wide	`[auto_review] policy = """..."""` in `config.toml`; `guardian_policy_config` for managed
Configuration precedence	Per-user concatenates with per-repo	Admin-enforced managed settings override workspace	Individual policies take precedence over managed defaults
Empirical posture	"Non-deterministic and can make mistakes in both directions" (docs)	0.4% FPR / 17% FNR on calibration set (Anthropic)	7 denials per 10,000 actions; 4 of 7 recovered (OpenAI)

The setting names differ; the three-tier router and steering-by-custom-instructions are the load-bearing invariants.

Example¶

A platform team enables classifier-subagent run mode across a Cursor 3.6 fleet maintaining a multi-service repo. The permissions.json ships in the repo:

{
  "terminalAllowlist": [
    "pnpm test",
    "pnpm lint",
    "git status",
    "git diff *"
  ],
  "autoRun": {
    "allow_instructions": [
      "Read-only kubectl get against the staging cluster",
      "Database migrations against ephemeral preview-env databases"
    ],
    "block_instructions": [
      "Any write to terraform/state/ or terraform/production/",
      "Any kubectl command targeting the prod-* or prd-* contexts",
      "Any helm install or helm upgrade against production",
      "Any pnpm publish or npm publish"
    ]
  }
}

The allowlist absorbs the per-keystroke test and git calls. The sandbox absorbs short scripts the agent writes during a task. The classifier reads the autoRun instructions plus the user's current prompt and routes a Terraform refactor: it permits a terraform plan against the staging workspace, suggests-alternative when the agent proposes terraform apply against the production workspace, and escalates a helm upgrade to the user — the deny instruction was strong enough to flag but not deterministic enough to hard-block.

The classifier-subagent run mode does not protect against an attacker who can write permissions.json itself. That's why the file ships through CODEOWNERS-protected paths and managed settings, not as a workspace-editable preference.

Key Takeaways¶

Classifier-subagent run mode is a three-tier router, not a single gate: allowlist absorbs deterministic calls, sandbox absorbs containable ones, classifier subagent routes the residual.
Cursor 3.6's three-way verdict (allow / suggest alternative / escalate) is the load-bearing innovation — binary classifiers reintroduce rubber-stamp pressure on the third path.
Custom-instructions steering at the classifier layer (permissions.json autoRun, Codex [auto_review] policy, Claude Code managed settings) is what lets the same harness ship safely across very different repos.
The classifier sees a different context than the doing agent (reasoning-blind by design) — that asymmetry is what defeats the doing agent's pressure to argue around boundaries.
Workload-conditional published numbers (calibration 17% FNR vs adversarial 81% FNR) and the 36.8% Tier-2 bypass measured by Ji et al. mean the classifier is a probabilistic floor, not a security guarantee.
The pattern does not close the lethal trifecta — it narrows the egress decision but a side-effecting allowlisted shell with private-data context and any untrusted fetch input is still an exfiltration path. Remove a trifecta leg first.

Classifier-Gated Auto-Permission for Cloud-IDE Coding Agents — The cross-vendor pattern this page operationalizes as a per-call router, with the four-design-choice framework
Inference-Time Tool-Call Reviewer — The complementary slot: reviews call correctness, not escalation status, with Helpfulness-Harmfulness metrics
Lethal Trifecta Threat Model — Why narrowing the egress decision is not the same as closing the trifecta
Human-in-the-Loop Confirmation Gates — The mandatory-approval tier for irreversible actions the classifier should not decide
Deferred Permission Pattern — How headless sessions pause for out-of-band human approval when the classifier escalates