Runbooks as Agent Instructions¶
Runbooks written for humans fail for agents through implicit context, ambiguous decision points, and assumed knowledge — and each failure mode needs a different fix.
Runbooks-as-agent-instructions are operational procedures rewritten so an agent can execute them end-to-end: every implicit action becomes an explicit tool call, every ambiguous condition becomes a measurable threshold, and every assumed context is declared in the runbook itself. The rewriting is driven by a three-question audit, not a template. A concrete adoption target — all operational runbooks followable by the agent within a fixed time window — surfaces the core problem: most runbooks are written as memory aids for experienced operators, not as executable instructions.
Why Human Runbooks Fail for Agents¶
Human runbooks fail for agents in three distinct ways:
| Failure mode | Example | Agent's problem |
|---|---|---|
| Implicit action | "Check the dashboard" | No tool to call, no success criterion |
| Ambiguous condition | "If load looks high..." | Cannot evaluate a vague threshold |
| Assumed context | "Restart the usual way" | No access to tribal knowledge |
Each failure mode requires a different fix. An audit step before rewriting identifies which failure applies to each step.
The Audit Workflow¶
Before rewriting anything, audit each runbook step against three questions:
- Can the agent invoke this? If the step requires clicking a UI, calling a named API, or running a shell command, the agent needs the exact invocation — endpoint, flags, expected output format.
- Can the agent evaluate this condition? Decision points ("if this looks wrong") must become explicit conditionals with a measurable signal and a threshold.
- Does the step depend on knowledge the agent doesn't have? Service topology, escalation contacts, system quirks — these must be declared explicitly or loaded via a references directory.
Steps that fail question 1 need tool-call replacements. Steps that fail question 2 need explicit conditionals. Steps that fail question 3 need supporting context injected. Anthropic's guidance on building effective agents notes that tool definitions require the same deliberate engineering attention as system prompts — "example usage, edge cases, input format requirements, and clear boundaries from other tools" (Anthropic: Building Effective Agents).
Before and After: Step Transformations¶
Implicit action → explicit tool call
Before:
Check the API error rate in Datadog
After:
datadog-query.sh service=api metric=error_rate window=5m
# Expected: value < 0.01 (1%)
# If above threshold: proceed to step 4
Ambiguous condition → explicit conditional
Before:
If load looks high, scale up the service
After:
If CPU utilization > 80% for 3 consecutive minutes:
kubectl scale deployment/api --replicas=$(current_replicas + 2)
Assumed context → declared reference
Before:
Escalate to the on-call if needed
After:
If unresolved after 15 minutes:
Page primary on-call via: pagerduty-alert.sh team=platform severity=high
Include: incident start time, steps attempted, current metric values
Packaging as a Skill¶
The correct container for an agent-executable runbook is a SKILL.md file with disable-model-invocation: true. This setting means the agent knows the runbook exists but only executes it when explicitly invoked — the human on-call triggers the runbook, the agent does not decide to run it autonomously.
.claude/skills/
runbooks/
database-failover.md # SKILL.md for DB failover
api-high-error-rate.md # SKILL.md for API errors
auth-service-degraded.md # SKILL.md for auth issues
scripts/
datadog-query.sh
pagerduty-alert.sh
kubectl-scale.sh
A routing runbook at the top level directs the agent to the relevant sub-skill based on incident type — this is the progressive disclosure pattern applied to incident response:
# Skill: Incident Response
disable-model-invocation: true
## Available runbooks
- database-failover: DB replica lag, connection pool exhaustion, failover
- api-high-error-rate: 5xx spikes, latency degradation, circuit breaker trips
- auth-service-degraded: login failures, token validation errors, SSO issues
Read only the runbook that matches the current incident type.
The scripts/ directory holds the executable shell commands referenced in runbook steps, replacing "run the query" with an actual script the agent can invoke.
Routing Architecture¶
graph TD
A[incident-response skill] -->|identifies type| B{Incident type}
B -->|DB lag| C[database-failover.md]
B -->|API errors| D[api-high-error-rate.md]
B -->|Auth failures| E[auth-service-degraded.md]
C --> F[scripts/db-health.sh]
C --> G[scripts/failover-trigger.sh]
D --> H[scripts/datadog-query.sh]
D --> I[scripts/scale-deployment.sh]
The routing runbook loads only ~100 tokens at session start. The specific runbook body (~2000–5000 tokens) loads only when the relevant incident type is identified. Supporting scripts load only when invoked.
Multi-Step State Tracking¶
For incidents spanning multiple sessions or requiring human handoffs, the runbook should include a progress file pattern. The agent writes a structured state file after completing each step:
## Progress tracking
After each step, append to /tmp/incident-{timestamp}.md:
- Step number and name
- Command executed
- Output summary
- Next step
This file is the handoff artifact if the session ends or a different operator continues.
This is equivalent to the feature list and progress file pattern described in harness engineering for long-running agents (Anthropic: Effective Harnesses). See Goal Monitoring and Progress Tracking for the full pattern.
Adoption Driver: Measurable Goals¶
A binary adoption target — "all runbooks followable by the agent" — works because it is auditable. A runbook either passes or fails the agent-followable test. Vague goals ("improve our runbooks") produce inconsistent effort. A binary test with a deadline produces a complete audit.
Operationally: assign 1 engineer to run each runbook against an agent in a test environment. Any step the agent fails to execute or evaluate becomes a tracked rewrite item. This surfaces the actual failure distribution across the runbook library before any rewriting begins.
When This Backfires¶
Agent-executable runbooks work when each step has a deterministic, tool-invokable form. The pattern breaks down in three conditions:
- Human-judgment steps that cannot be made explicit. Some decisions depend on live context that no metric captures — a degraded-but-not-alerting system that an experienced operator would deprioritize. Converting these to thresholds produces either false positives or missed escalations. These steps are better handled with a human-in-the-loop gate than a scripted conditional.
- Cross-system state coordination. Runbooks that span multiple teams, change-freeze windows, or external vendor actions assume the agent can verify state it cannot observe. If the agent cannot confirm the dependency is met, it proceeds on false assumptions.
- High blast-radius actions. Failover triggers, database writes, and traffic reroutes carry irreversible consequences. The
disable-model-invocation: truepackaging mitigates this by requiring human initiation, but it does not prevent the agent from executing a step sequence in the wrong incident context if routing is misconfigured.
The audit-before-rewriting step is the safeguard: steps that cannot be made unambiguous should be flagged as human checkpoints, not converted.
Key Takeaways¶
- Audit before rewriting: identify whether each failing step has an implicit action, an ambiguous condition, or assumed context — each requires a different fix
- Replace implicit actions with exact tool calls including expected output format and success criteria
- Replace ambiguous conditions with measurable thresholds and explicit branching logic
- Package rewritten runbooks as SKILL.md files with
disable-model-invocation: true— the human triggers, the agent executes - Use a routing skill to direct the agent to the relevant runbook sub-skill without loading all runbooks into context
- A binary pass/fail test ("can the agent follow this runbook end-to-end?") is a more effective adoption driver than qualitative improvement goals
Related¶
- Progressive Disclosure for Agent Definitions
- Separation of Knowledge and Execution
- Human-in-the-Loop Placement
- Agent Skills: Cross-Tool Task Knowledge Standard
- Circuit Breakers for Agent Loops
- Incident Log Investigation Skill
- Trajectory Logging via Progress Files and Git History
- Encoding Tacit Knowledge into Agent Improvement Loops