Inline Safety Harness with Cascade Verification (FinHarness)¶
Wrap each agent turn with prospective per-call monitors and route verification between a cheap and an advanced judge by per-step risk.
An inline safety harness evaluates each prospective tool call before it executes and adaptively chooses how much verification to spend on it, rather than filtering only at the conversation boundary or auditing after the agent terminates. FinHarness (Jia et al., 2026) is the finance-domain instance: a Query Monitor that fuses single-turn intent with cross-turn drift, a Tool Monitor that scores each prospective call, and a Cascade module that routes verification between a lightweight and an advanced-tier judge.
The pattern pays off only under specific conditions. Lead with them — outside this envelope, a simpler stack wins.
When This Applies¶
The harness earns its operational cost only when these conditions hold together:
- Irreversible mid-trajectory actions. Boundary filters miss tool calls that fire mid-turn; post-hoc judges audit only after termination, too late to intervene and at a cost that scales linearly with trace length (Jia et al., 2026). Prospective per-call evaluation is the only placement that can block an irreversible call before it commits.
- High tool-call volume per session. Cascade routing amortises over many calls. With a low call count, the routing overhead exceeds the advanced-judge cost it saves.
- A two-tier judge budget worth optimising. Routed FinHarness used 4.7x fewer advanced-judge calls than an always-advanced ablation while cutting attack success rate from 38.3% to 15.0% on the FinVault benchmark, with benign approval largely preserved (41.1% to 39.3%) (Jia et al., 2026). The saving matters only when the advanced judge is expensive enough to ration.
- Bounded, trustworthy context. The re-injection step (below) assumes the agent can reliably act on safety evidence delivered through its own context — a condition that degrades as context grows or is poisoned.
How It Works¶
graph TD
Q["Query Monitor<br>intent + cross-turn drift"] --> TM["Tool Monitor<br>score prospective call"]
TM --> C{"Cascade<br>per-step risk"}
C -->|low risk| L["Lightweight judge"]
C -->|high risk| A["Advanced judge"]
L --> R["Re-inject fired risk<br>as ex-ante evidence"]
A --> R
R --> D{"Agent decides"}
D -->|refuse / re-plan| Q
D -->|approve| E["Tool executes"]
style E fill:#1a7f37,color:#fff
The three monitors run inline on every turn (Jia et al., 2026):
- Query Monitor — fuses the current turn's intent with accumulated cross-turn drift, catching goals that shift gradually across a multi-step workflow.
- Tool Monitor — scores each prospective tool call against policy before it executes.
- Cascade module — routes verification by per-step risk: low-risk steps get the cheap judge, high-risk steps escalate to the advanced one. This is where the cost saving comes from.
The distinctive step: fired risk factors are re-injected into the agent input as ex-ante evidence, so the agent can refuse, re-plan, or approve on its own (Jia et al., 2026). The harness does not just block — it returns the verdict to the agent as a governance signal.
Why It Works¶
Cascade routing reduces aggregate verification cost without surrendering coverage because risk is unevenly distributed across steps: most calls are routine and a cheap judge clears them, while the expensive judge is reserved for the few high-risk transitions that justify it (Jia et al., 2026). The same economics drive general cost-optimal cascade routing, where calibrated per-call uncertainty decides when to escalate (UCCI, 2026). Prospective inline monitoring closes the gap that boundary filters (blind to mid-trajectory calls) and post-hoc judges (too late, and linear in trace length) both leave open.
When This Backfires¶
The re-injection design makes a contested bet: that the agent's context is trustworthy enough to act on. Three conditions break it.
- Long or poisoned context. Refusal behaviour in long-context agents is unstable — refusal rates shift unpredictably as context grows (one model rose from ~5% to ~40%, another fell from ~80% to ~10% at 200K tokens), with task performance dropping over 50% by 100K tokens (Hadeliya et al., 2026). Re-injecting risk evidence into the same context relies on exactly the behaviour that degrades. An agent under active indirect injection is worse still: feedback-guided payloads achieve full success on a majority of tested targets, including production coding agents (IterInject, 2026). Where the control plane can sit outside the model context, that is the stronger guarantee — CaMeL makes the opposite architectural bet for this reason.
- Low call volume or tight latency. Below roughly five tool calls per session, the cascade overhead exceeds the advanced-judge cost it saves. Under a tight latency budget (sub-200ms p99), tiered judging adds inference hops that break the SLA — a single fast classifier or a behavioral firewall fits better.
- Reversible tools. Prospective pre-execution evaluation is overkill when actions can be undone; saga-style compensation after the fact is cheaper and easier to test.
For most deployments below a high-stakes-regulated bar, a simpler stack — deterministic per-call policy, a single capable judge at high-risk transitions, and compensation for reversible tools — covers the same surface at lower cost.
Key Takeaways¶
- Inline prospective monitoring blocks irreversible mid-trajectory calls that boundary filters miss and post-hoc judges catch too late (Jia et al., 2026).
- Cascade routing between a cheap and an advanced judge cut advanced-judge calls 4.7x while reducing attack success rate from 38.3% to 15.0% — the saving matters only when the advanced judge is expensive enough to ration.
- Re-injecting fired risk into the agent's context for self-governance assumes a trustworthy context; long-context refusal instability and indirect injection make that conditional.
- Apply the full harness only to high-stakes, high-call-volume, irreversible-action workflows; simpler defenses dominate elsewhere.
Related¶
- Lifecycle-Integrated Security Architecture for Agent Harnesses — the closest neighbour: four-phase lifecycle defense with cross-layer feedback, without the cost-adaptive cascade routing
- Mid-Trajectory Guardrail Selection for Multi-Step Tool Calls — how to pick the judge model the cascade routes between
- Behavioral Firewall for Tool-Call Trajectories — deterministic alternative for stable-catalog, latency-sensitive workflows
- Control/Data-Flow Separation for Prompt Injection Defense (CaMeL) — the opposite bet: keep the control plane outside the model context
- Defense-in-Depth Agent Safety — the layering principle the harness extends rather than replaces