Skip to content

Runbooks as Agent Instructions

Runbooks written for humans fail for agents through implicit context, ambiguous decision points, and assumed knowledge — and each failure mode needs a different fix.

Runbooks-as-agent-instructions are operational procedures rewritten so an agent can execute them end-to-end: every implicit action becomes an explicit tool call, every ambiguous condition becomes a measurable threshold, and every assumed context is declared in the runbook itself. The rewriting is driven by a three-question audit, not a template. A concrete adoption target — all operational runbooks followable by the agent within a fixed time window — surfaces the core problem: most runbooks are written as memory aids for experienced operators, not as executable instructions.

Why Human Runbooks Fail for Agents

Human runbooks fail for agents in three distinct ways:

Failure mode Example Agent's problem
Implicit action "Check the dashboard" No tool to call, no success criterion
Ambiguous condition "If load looks high..." Cannot evaluate a vague threshold
Assumed context "Restart the usual way" No access to tribal knowledge

Each failure mode requires a different fix. An audit step before rewriting identifies which failure applies to each step.

The Audit Workflow

Before rewriting anything, audit each runbook step against three questions:

  1. Can the agent invoke this? If the step requires clicking a UI, calling a named API, or running a shell command, the agent needs the exact invocation — endpoint, flags, expected output format.
  2. Can the agent evaluate this condition? Decision points ("if this looks wrong") must become explicit conditionals with a measurable signal and a threshold.
  3. Does the step depend on knowledge the agent doesn't have? Service topology, escalation contacts, system quirks — these must be declared explicitly or loaded via a references directory.

Steps that fail question 1 need tool-call replacements. Steps that fail question 2 need explicit conditionals. Steps that fail question 3 need supporting context injected. Anthropic's guidance on building effective agents notes that tool definitions require the same deliberate engineering attention as system prompts — "example usage, edge cases, input format requirements, and clear boundaries from other tools" (Anthropic: Building Effective Agents).

Before and After: Step Transformations

Implicit action → explicit tool call

Before:

Check the API error rate in Datadog

After:

datadog-query.sh service=api metric=error_rate window=5m
# Expected: value < 0.01 (1%)
# If above threshold: proceed to step 4

Ambiguous condition → explicit conditional

Before:

If load looks high, scale up the service

After:

If CPU utilization > 80% for 3 consecutive minutes:
  kubectl scale deployment/api --replicas=$(current_replicas + 2)

Assumed context → declared reference

Before:

Escalate to the on-call if needed

After:

If unresolved after 15 minutes:
  Page primary on-call via: pagerduty-alert.sh team=platform severity=high
  Include: incident start time, steps attempted, current metric values

Packaging as a Skill

The correct container for an agent-executable runbook is a SKILL.md file with disable-model-invocation: true. This setting means the agent knows the runbook exists but only executes it when explicitly invoked — the human on-call triggers the runbook, the agent does not decide to run it autonomously.

.claude/skills/
  runbooks/
    database-failover.md          # SKILL.md for DB failover
    api-high-error-rate.md        # SKILL.md for API errors
    auth-service-degraded.md      # SKILL.md for auth issues
    scripts/
      datadog-query.sh
      pagerduty-alert.sh
      kubectl-scale.sh

A routing runbook at the top level directs the agent to the relevant sub-skill based on incident type — this is the progressive disclosure pattern applied to incident response:

# Skill: Incident Response

disable-model-invocation: true

## Available runbooks
- database-failover: DB replica lag, connection pool exhaustion, failover
- api-high-error-rate: 5xx spikes, latency degradation, circuit breaker trips
- auth-service-degraded: login failures, token validation errors, SSO issues

Read only the runbook that matches the current incident type.

The scripts/ directory holds the executable shell commands referenced in runbook steps, replacing "run the query" with an actual script the agent can invoke.

Routing Architecture

graph TD
    A[incident-response skill] -->|identifies type| B{Incident type}
    B -->|DB lag| C[database-failover.md]
    B -->|API errors| D[api-high-error-rate.md]
    B -->|Auth failures| E[auth-service-degraded.md]
    C --> F[scripts/db-health.sh]
    C --> G[scripts/failover-trigger.sh]
    D --> H[scripts/datadog-query.sh]
    D --> I[scripts/scale-deployment.sh]

The routing runbook loads only ~100 tokens at session start. The specific runbook body (~2000–5000 tokens) loads only when the relevant incident type is identified. Supporting scripts load only when invoked.

Multi-Step State Tracking

For incidents spanning multiple sessions or requiring human handoffs, the runbook should include a progress file pattern. The agent writes a structured state file after completing each step:

## Progress tracking
After each step, append to /tmp/incident-{timestamp}.md:
  - Step number and name
  - Command executed
  - Output summary
  - Next step

This file is the handoff artifact if the session ends or a different operator continues.

This is equivalent to the feature list and progress file pattern described in harness engineering for long-running agents (Anthropic: Effective Harnesses). See Goal Monitoring and Progress Tracking for the full pattern.

Adoption Driver: Measurable Goals

A binary adoption target — "all runbooks followable by the agent" — works because it is auditable. A runbook either passes or fails the agent-followable test. Vague goals ("improve our runbooks") produce inconsistent effort. A binary test with a deadline produces a complete audit.

Operationally: assign 1 engineer to run each runbook against an agent in a test environment. Any step the agent fails to execute or evaluate becomes a tracked rewrite item. This surfaces the actual failure distribution across the runbook library before any rewriting begins.

When This Backfires

Agent-executable runbooks work when each step has a deterministic, tool-invokable form. The pattern breaks down in three conditions:

  • Human-judgment steps that cannot be made explicit. Some decisions depend on live context that no metric captures — a degraded-but-not-alerting system that an experienced operator would deprioritize. Converting these to thresholds produces either false positives or missed escalations. These steps are better handled with a human-in-the-loop gate than a scripted conditional.
  • Cross-system state coordination. Runbooks that span multiple teams, change-freeze windows, or external vendor actions assume the agent can verify state it cannot observe. If the agent cannot confirm the dependency is met, it proceeds on false assumptions.
  • High blast-radius actions. Failover triggers, database writes, and traffic reroutes carry irreversible consequences. The disable-model-invocation: true packaging mitigates this by requiring human initiation, but it does not prevent the agent from executing a step sequence in the wrong incident context if routing is misconfigured.

The audit-before-rewriting step is the safeguard: steps that cannot be made unambiguous should be flagged as human checkpoints, not converted.

Key Takeaways

  • Audit before rewriting: identify whether each failing step has an implicit action, an ambiguous condition, or assumed context — each requires a different fix
  • Replace implicit actions with exact tool calls including expected output format and success criteria
  • Replace ambiguous conditions with measurable thresholds and explicit branching logic
  • Package rewritten runbooks as SKILL.md files with disable-model-invocation: true — the human triggers, the agent executes
  • Use a routing skill to direct the agent to the relevant runbook sub-skill without loading all runbooks into context
  • A binary pass/fail test ("can the agent follow this runbook end-to-end?") is a more effective adoption driver than qualitative improvement goals
Feedback