Skip to content

OpenAPI Documentation Smells for Agent-Ready APIs

A structurally valid OpenAPI spec is not an agent-ready API. Enriching the spec text alone moved agent task success from roughly 70% to 90%.

The semantic-readiness gap

An OpenAPI document that passes openapi-validator can still fail an agent. Lima, Pinheiro, and Menezes audited 16 production APIs (~600 endpoints) being prepared for Model Context Protocol exposure and found that "structural validity within microservice environments does not guarantee semantic readiness for agent-based consumption"; enriching descriptions alone moved task success from ~70% to ~90%.

Adjacent work agrees. AutoMCP reached 99.9% tool-call success only after averaging 19 lines of spec edits per API, up from a 76.5% baseline. Stainless documents Notion requiring an undeclared Notion-Version header, and APIs declaring auth on 5 of 24 endpoints when all 24 need it.

The smell taxonomy

Four documentation smells describe how the prose around an endpoint fails an agent:

Smell What it looks like
LAZY Short summaries, vague descriptions, undocumented parameters, generic response messages
BLOATED Verbose prose that adds tokens without adding decision-relevant information
TANGLED Business logic, security, and error handling mixed into the same description fragment
FRAGMENTED Essential information dispersed across disconnected sections with no linkage

Five REST smells describe how the endpoint design itself misleads an agent:

Smell What it looks like
PATH Action-oriented URIs (/doTransfer) that hide the underlying resource
METHOD POST used for reads, GET used for state changes, mismatched semantics
INPUT Weakly specified parameters, missing format constraints, no semantic clarification
RESPONSE Inconsistent schemas, undocumented status codes, missing error shapes
SECURITY Missing or unclear authentication and authorization definitions

Prevalence in the Hermes corpus skewed heavily: 100% of endpoints had RESPONSE smells, 90% LAZY, 88% INPUT, 68% SECURITY, 53% PATH or METHOD. This is the default state of human-targeted documentation when an agent is the reader.

Why each category matters

OpenAPI was written for developers with implicit context — codebase, Slack, ticket history — that agents do not see. Each smell names a specific gap between human-implicit and machine-explicit information.

Scenario-first triage

A 2,450-finding report is unactionable. The Hermes study itself pivoted to selective adaptation: estimated effort dropped from 385 to 42 engineering hours — an 89% reduction — by fixing only endpoints needed for defined automation scenarios.

MCP practice agrees. GitHub Copilot and Block cut tool counts by 60-93% before agents became reliable; Speakeasy's guidance is to "autogenerate the groundwork from OpenAPI, then curate".

graph TD
    A[Define agent scenarios] --> B[Identify required endpoints]
    B --> C[Audit only those endpoints]
    C --> D{Smells found?}
    D -->|Yes| E[Remediate spec text]
    D -->|No| F[Generate MCP tools]
    E --> F

Audit only the endpoints your scenarios need; the rest can wait.

Mechanizing the audit

Hermes dispatches nine specialized smell-detector agents from one orchestrator, each analyzing the same endpoint from one category's perspective — a textbook orchestrator-worker fan-out. Model selection matters less than expected: gpt-oss:120b reached 0.85 Jaccard similarity with expert annotations. Frontier pricing is not required.

Static linters (Spectral, Redocly CLI) catch PATH, METHOD, and structural INPUT/RESPONSE issues at design time. Reach for LLM-based detection on the prose-shaped smells — LAZY, BLOATED, TANGLED, FRAGMENTED — where static rules cannot judge information density.

When this backfires

Description enrichment is not always where the effort pays off. The taxonomy under-delivers when:

Audit the prose when the endpoint set is already scoped and the resource model is sound. When the API design itself is the problem, fix the design first.

Example

A LAZY and INPUT smell pair, taken from the kind of spec the Hermes study evaluated:

Before — agent-hostile:

/users/{id}/transfer:
  post:
    summary: Transfer
    operationId: doTransfer
    parameters:
      - name: id
        in: path
        schema:
          type: string
    requestBody:
      content:
        application/json:
          schema:
            type: object
            properties:
              amount:
                type: number
              to:
                type: string
    responses:
      '200':
        description: OK

After — agent-ready:

/accounts/{accountId}/transfers:
  post:
    summary: Create a transfer from one account to another
    description: |
      Use when the caller has the source account ID and wants to move funds to a
      destination account in the same currency. Returns the created transfer
      including its server-assigned ID and settlement timestamp. Returns 409 if
      a transfer with the same Idempotency-Key already exists.
    operationId: createTransfer
    parameters:
      - name: accountId
        in: path
        description: UUID v4 of the source account.
        required: true
        schema:
          type: string
          format: uuid
    requestBody:
      content:
        application/json:
          schema:
            type: object
            required: [amount, destinationAccountId]
            properties:
              amount:
                type: integer
                description: Amount in minor units (e.g. cents). Must be > 0.
                minimum: 1
              destinationAccountId:
                type: string
                format: uuid
                description: UUID v4 of the destination account.
    responses:
      '200':
        description: Transfer created.
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/Transfer'
      '409':
        description: Duplicate Idempotency-Key.

The PATH smell (/users/{id}/transfer with doTransfer) collapsed into a resource-shaped URI with createTransfer. The LAZY summary expanded into a description that names when to call, what comes back, and what the failure case looks like. INPUT gained format constraints and required-field declarations. RESPONSE gained a documented 409.

Key Takeaways

  • An OpenAPI spec that validates structurally can still fail agents — every one of 600 audited endpoints had at least one smell.
  • The nine-category taxonomy (four documentation, five REST) gives a concrete checklist that maps to specific agent failure modes.
  • Audit only the endpoints your defined agent scenarios need. Blanket audits produce reports nobody acts on; selective audits cut remediation effort by an order of magnitude (385 to 42 engineering hours in the Hermes study).
  • Use static linters for structural smells and LLM-based detection for prose-shaped smells. Open-weight models reach expert-level Jaccard on this task — frontier pricing is not required.
Feedback