Unbounded Consumption: Bounding Agent Resource Use Against DoS and Denial-of-Wallet¶
Agent harnesses bind DoS and denial-of-wallet to one control surface — per-call, per-task, concurrency, velocity, and budget bounds — that no single layer covers alone.
The Threat¶
OWASP LLM10:2025 Unbounded Consumption names four sub-classes the same harness can produce (OWASP LLM10:2025 mirror):
| Sub-class | Mechanism | Owner |
|---|---|---|
| Variable-length input | Oversized input drives CPU/memory load until the service degrades | Availability |
| Denial of wallet | Attacker drives token consumption on a pay-per-use account; service stays up, bill drains | Finance |
| Resource amplification | Crafted input triggers the model's most expensive paths (long output, tool chains) | Both |
| Model replication | API access used to mint synthetic training data for a derivative model | Product/legal |
The first three share a structural feature: an LLM call's cost is variable and attacker-influenceable (input length, output length, tool-chain depth), priced linearly. The same retry loop that drains the wallet can exhaust a rate-shared backend — bounding it is a security control, not a finance preference.
Sysdig's LLMjacking research documented up to $46,000/day against AWS Bedrock at peak (Claude 2.x; up to 3x for Opus), with 85,000 Bedrock requests including 61,000 in a single 3-hour window (Sysdig). A stolen Google Gemini API key produced $82,000 in 48 hours in March 2026 (Truefoundry, 2026). Both applications looked healthy — DoS detection (latency, error rate) registered nothing.
The Five Bounds¶
No single layer covers the full cost dimension. Each bound closes a failure mode the others miss:
| Bound | What it caps | What it misses alone |
|---|---|---|
Per-call token cap (max_tokens) |
One model call's output size | Multi-call tool chains; expensive inputs |
| Per-task iteration cap | Agent loop depth (e.g. LangChain max_iterations=15) |
Cost variance per iteration; cheap-loop-but-expensive-call combinations |
| Fan-out concurrency cap | Parallel sub-agent or batch breadth | Sequential expense; long-running serial chains |
| Cost-velocity breaker | Rolling-average dollars/min per principal | Pre-existing baseline; first-time-expensive workloads |
| Per-day dollar budget | Absolute spend ceiling per (user, repo, model) | Within-day burst windows (3-hour Bedrock attack finishes before daily alarms) |
LangChain's AgentExecutor ships max_iterations=15 and supports max_execution_time (seconds) (LangChain docs), but the iteration cap is blind to per-step cost: a fast agent can burn 10 iterations in 8 seconds, and the iteration cap does not "track token spend, don't distinguish between a cheap and an expensive iteration, and can't enforce a daily dollar budget" (Truefoundry, 2026). The five bounds are complementary by design.
Bounds Routing¶
graph TD
A[Agent call] --> B[Per-call token cap]
B --> C[Per-task iteration cap]
C --> D[Fan-out concurrency cap]
D --> E[Cost-velocity breaker]
E --> F[Per-day dollar budget]
F --> G[Execute]
B -->|exceed| X[Reject early]
C -->|exceed| X
D -->|queue| Q[Backpressure]
E -->|trip| Y[Throttle / pause]
F -->|cap| Z[Block until window resets]
Why It Works¶
LLM calls have variable, attacker-influenceable cost, priced linearly. Requests-per-second does not bind dollars-per-second when one request costs $0.001 and the next $0.50 (Pignati, 2026). The five-bound surface works because each bound expresses a different unit of cost — tokens, iterations, parallelism, velocity, dollars — and the union covers what no single unit captures. OWASP LLM10 makes the routing explicit: the same bounds serve availability and finance owners without duplicating enforcement (OWASP LLM10:2025; Truefoundry, 2026).
When This Backfires¶
The bounds add real cost (config surface, false-positive risk, debugging difficulty). Five conditions invert the trade-off:
- Single-shot or batch-of-one agents — a CLI one-shot summariser has no loop to bound and no fan-out to throttle;
max_iterations=15is unused machinery. The bounds amortise across repeated invocations. - Trusted internal-only deployments — when callers are first-party services behind authn, the denial-of-wallet vector collapses; infra-level rate limits already cover availability. Avoid duplicating controls.
- Fixed thresholds without cost-velocity telemetry — "100 calls/min" misses the Continual Inconspicuous DoW pattern (low and slow over hours), which is "difficult to distinguish from legitimate traffic patterns" (arxiv:2508.19284). It also over-triggers on legitimate bursty workflows: a document-summarisation task involving file retrieval, chunking, three LLM calls, and storage will trip a tight bucket; "one rogue script blocks all the user's legitimate work, including the work they need to debug the rogue script" (Pignati, 2026). Tuple-keyed limits on
(user, repo, model)plus rolling-average velocity beat fixed absolutes. - Tool-chain amplification outside the model's token counter — per-call
max_tokensdoes not see chains: arxiv:2601.10955 demonstrates 658x cost amplification and trajectories exceeding 60,000 tokens against a model with a 4K per-call cap, by manipulating tool responses to coerce verbose multi-turn chains (arxiv:2601.10955). The per-task and cost-velocity bounds are the chain-level controls; per-call caps alone are blind. - Bounds enforced by brittle classifiers — when an LLM-based safeguard sits in the bounding path, the safeguard itself becomes a DoS vector: a 30-character adversarial suffix universally blocks over 97% of legitimate requests on Llama Guard 3 (arxiv:2410.02916). Deterministic counters (tokens, iterations, dollars) belong in the enforcement path; semantic checks belong in detection only.
Example¶
A multi-tenant agent platform that runs Claude-Code-style sub-agents per repository wires the five bounds as follows (illustrative composition drawn from Truefoundry's three-layer gateway, 2026):
# Per (user, repo, model) — not per user — so one runaway repo
# does not block the user's other work
limits:
per_call_max_tokens: 8192
per_task_max_iterations: 15
per_task_max_seconds: 300
fan_out_concurrency: 4
cost_velocity:
window_minutes: 5
multiplier_over_rolling_avg: 8
action: pause
per_day_dollar_budget:
claude_sonnet: 50.00
claude_opus: 200.00
on_exhaust: block_until_window
Each bound's failure case is named: per-call cap catches a runaway prompt, iteration cap catches a tool-call loop, fan-out cap caps a parallel-spawn injection, velocity breaker catches the unprecedented-cost spike, dollar budget is the daily backstop. Removing any one leaves a documented amplification path open.
Key Takeaways¶
- OWASP LLM10:2025 makes DoS and denial-of-wallet a same-surface, two-owner concern — the same bounds serve both threat models.
- No single bound covers the cost dimension; per-call, per-task, fan-out, cost-velocity, and per-day budget are complementary by design.
- Real incidents reach $46K/day and $82K/48hr ranges before any per-application detection fires; the 3-hour attack window finishes before daily billing alarms.
- Tool-chain amplification (658x in arxiv:2601.10955) routes around per-call token caps; chain-level bounds (iteration, velocity) are the structural control.
- Fixed RPS limits with single-bucket keying break legitimate workflows and miss low-and-slow DoW; tuple-keyed on
(user, repo, model)with rolling-average velocity is the working shape.
Related¶
- Agent Circuit Breaker — tool-level recovery state machine; complements the loop-level and budget-level bounds on this page
- Security Budget as Token Economics — pre-release audit sizing under the same cost-economics frame
- Loop Detection — observability signal that feeds the per-task iteration cap
- Blast Radius Containment: Least Privilege for AI Agents — complementary control axis; bounds cap consumption while least-privilege caps reach