Runbooks as Agent Instructions¶

Runbooks written for humans fail for agents through implicit context, ambiguous decision points, and assumed knowledge — and each failure mode needs a different fix.

Runbooks-as-agent-instructions are operational procedures rewritten so an agent can execute them end-to-end: every implicit action becomes an explicit tool call, every ambiguous condition becomes a measurable threshold, and every assumed context is declared in the runbook itself. The rewriting is driven by a three-question audit, not a template. A concrete adoption target — all operational runbooks followable by the agent within a fixed time window — surfaces the core problem: most runbooks are written as memory aids for experienced operators, not as executable instructions.

Why Human Runbooks Fail for Agents¶

Human runbooks fail for agents in three distinct ways:

Failure mode	Example	Agent's problem
Implicit action	"Check the dashboard"	No tool to call, no success criterion
Ambiguous condition	"If load looks high..."	Cannot evaluate a vague threshold
Assumed context	"Restart the usual way"	No access to tribal knowledge

Each failure mode requires a different fix. An audit step before rewriting identifies which failure applies to each step.

The Audit Workflow¶

Before rewriting anything, audit each runbook step against three questions:

Can the agent invoke this? If the step requires clicking a UI, calling a named API, or running a shell command, the agent needs the exact invocation — endpoint, flags, expected output format.
Can the agent evaluate this condition? Decision points ("if this looks wrong") must become explicit conditionals with a measurable signal and a threshold.
Does the step depend on knowledge the agent doesn't have? Service topology, escalation contacts, system quirks — these must be declared explicitly or loaded via a references directory.

Steps that fail question 1 need tool-call replacements. Steps that fail question 2 need explicit conditionals. Steps that fail question 3 need supporting context injected. Anthropic's guidance on building effective agents notes that tool definitions require the same deliberate engineering attention as system prompts — "example usage, edge cases, input format requirements, and clear boundaries from other tools" (Anthropic: Building Effective Agents).

Before and After: Step Transformations¶

Implicit action → explicit tool call

Before:

Check the API error rate in Datadog

After:

datadog-query.sh service=api metric=error_rate window=5m
# Expected: value < 0.01 (1%)
# If above threshold: proceed to step 4

Ambiguous condition → explicit conditional

Before:

If load looks high, scale up the service

After:

If CPU utilization > 80% for 3 consecutive minutes:
  kubectl scale deployment/api --replicas=$(current_replicas + 2)

Assumed context → declared reference

Before:

Escalate to the on-call if needed

After:

If unresolved after 15 minutes:
  Page primary on-call via: pagerduty-alert.sh team=platform severity=high
  Include: incident start time, steps attempted, current metric values

Packaging as a Skill¶

The correct container for an agent-executable runbook is a SKILL.md file with disable-model-invocation: true. This setting means the agent knows the runbook exists but only executes it when explicitly invoked — the human on-call triggers the runbook, the agent does not decide to run it autonomously.

.claude/skills/
  runbooks/
    database-failover.md          # SKILL.md for DB failover
    api-high-error-rate.md        # SKILL.md for API errors
    auth-service-degraded.md      # SKILL.md for auth issues
    scripts/
      datadog-query.sh
      pagerduty-alert.sh
      kubectl-scale.sh

A routing runbook at the top level directs the agent to the relevant sub-skill based on incident type — this is the progressive disclosure pattern applied to incident response:

# Skill: Incident Response

disable-model-invocation: true

## Available runbooks
- database-failover: DB replica lag, connection pool exhaustion, failover
- api-high-error-rate: 5xx spikes, latency degradation, circuit breaker trips
- auth-service-degraded: login failures, token validation errors, SSO issues

Read only the runbook that matches the current incident type.

The scripts/ directory holds the executable shell commands referenced in runbook steps, replacing "run the query" with an actual script the agent can invoke.

Routing Architecture¶

graph TD
    A[incident-response skill] -->|identifies type| B{Incident type}
    B -->|DB lag| C[database-failover.md]
    B -->|API errors| D[api-high-error-rate.md]
    B -->|Auth failures| E[auth-service-degraded.md]
    C --> F[scripts/db-health.sh]
    C --> G[scripts/failover-trigger.sh]
    D --> H[scripts/datadog-query.sh]
    D --> I[scripts/scale-deployment.sh]

The routing runbook loads only ~100 tokens at session start. The specific runbook body (~2000–5000 tokens) loads only when the relevant incident type is identified. Supporting scripts load only when invoked.

Multi-Step State Tracking¶

For incidents spanning multiple sessions or requiring human handoffs, the runbook should include a progress file pattern. The agent writes a structured state file after completing each step:

## Progress tracking
After each step, append to /tmp/incident-{timestamp}.md:
  - Step number and name
  - Command executed
  - Output summary
  - Next step

This file is the handoff artifact if the session ends or a different operator continues.

This is equivalent to the feature list and progress file pattern described in harness engineering for long-running agents (Anthropic: Effective Harnesses). See Goal Monitoring and Progress Tracking for the full pattern.

Adoption Driver: Measurable Goals¶

A binary adoption target — "all runbooks followable by the agent" — works because it is auditable. A runbook either passes or fails the agent-followable test. Vague goals ("improve our runbooks") produce inconsistent effort. A binary test with a deadline produces a complete audit.

Operationally: assign 1 engineer to run each runbook against an agent in a test environment. Any step the agent fails to execute or evaluate becomes a tracked rewrite item. This surfaces the actual failure distribution across the runbook library before any rewriting begins.

When This Backfires¶

Agent-executable runbooks work when each step has a deterministic, tool-invokable form. The pattern breaks down in three conditions:

Human-judgment steps that cannot be made explicit. Some decisions depend on live context that no metric captures — a degraded-but-not-alerting system that an experienced operator would deprioritize. Converting these to thresholds produces either false positives or missed escalations. These steps are better handled with a human-in-the-loop gate than a scripted conditional.
Cross-system state coordination. Runbooks that span multiple teams, change-freeze windows, or external vendor actions assume the agent can verify state it cannot observe. If the agent cannot confirm the dependency is met, it proceeds on false assumptions.
High blast-radius actions. Failover triggers, database writes, and traffic reroutes carry irreversible consequences. The disable-model-invocation: true packaging mitigates this by requiring human initiation, but it does not prevent the agent from executing a step sequence in the wrong incident context if routing is misconfigured.

The audit-before-rewriting step is the safeguard: steps that cannot be made unambiguous should be flagged as human checkpoints, not converted.

Key Takeaways¶

Audit before rewriting: identify whether each failing step has an implicit action, an ambiguous condition, or assumed context — each requires a different fix
Replace implicit actions with exact tool calls including expected output format and success criteria
Replace ambiguous conditions with measurable thresholds and explicit branching logic
Package rewritten runbooks as SKILL.md files with disable-model-invocation: true — the human triggers, the agent executes
Use a routing skill to direct the agent to the relevant runbook sub-skill without loading all runbooks into context
A binary pass/fail test ("can the agent follow this runbook end-to-end?") is a more effective adoption driver than qualitative improvement goals

Runbooks as Agent Instructions¶

Why Human Runbooks Fail for Agents¶

The Audit Workflow¶

Before and After: Step Transformations¶

Packaging as a Skill¶

Routing Architecture¶

Multi-Step State Tracking¶

Adoption Driver: Measurable Goals¶

When This Backfires¶

Key Takeaways¶

Related¶