Skip to content

AI Crawler Policy: robots.txt for the Three-Tier Crawler Landscape

AI crawlers split into retrieval bots (allow for citations), training scrapers (disallow), and non-compliant bots (WAF block) — each requiring a distinct robots.txt strategy.

The Three-Tier Taxonomy

AI crawlers are not monolithic. Each major provider now operates separate bots for distinct purposes, each with its own user-agent string:

Tier Purpose User-agents robots.txt behaviour
Tier 1 — User-facing retrieval Powers real-time citations in AI chat and search ChatGPT-User*, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot†, Perplexity-User Allow — drives referral traffic and AI citations
Tier 2 — Training scrapers Ingests content for model training datasets GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent Disallow — no citation benefit; opts out of training data
Tier 3 — Non-compliant bots Crawlers documented to ignore robots.txt Bytespider (ByteDance) CDN/WAF block — robots.txt is ineffective

The tier distinction matters: blocking training crawlers without also blocking retrieval bots keeps your content eligible for AI search citations while opting out of model training datasets.

* As of OpenAI's December 2025 policy update, ChatGPT-User no longer respects robots.txt; disallow rules are ignored (coverage).

† Cloudflare documented Perplexity rotating user-agents and ASNs to bypass robots.txt (August 2025 report). Use WAF for hard blocks.

Decision Matrix

Goal Action
Appear in AI search answers (ChatGPT, Claude, Perplexity) Allow Tier 1
Prevent content entering training datasets Disallow Tier 2
Stop ByteDance/Bytespider from crawling WAF custom rule
Opt out of everything Disallow all AI user-agents + WAF

The emerging practitioner consensus for documentation sites: allow Tier 1, disallow Tier 2.

Reference Configuration

This site's robots.txt implements the three-tier policy:

# ── Default: allow all standard crawlers ──────────────────────────────────────
User-agent: *
Allow: /

# ── Tier 1: User-facing retrieval bots (ALLOW) ────────────────────────────────

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# ── Tier 2: Training scrapers (DISALLOW) ──────────────────────────────────────

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# ── Tier 3: CDN-level block (robots.txt ineffective) ──────────────────────────
# Bytespider — configure WAF custom rule: User-Agent contains "Bytespider" → Block

Sitemap: https://agentpatterns.ai/sitemap.xml

Compliance Caveats

robots.txt is advisory, not enforceable. Key nuances:

  • Major providers comply: OpenAI (GPTBot, OAI-SearchBot), Anthropic (ClaudeBot, Claude-SearchBot, Claude-User), and Google (Google-Extended) respect robots.txt directives.
  • ChatGPT-User exempt (Dec 2025): OpenAI's updated crawler documentation reclassified ChatGPT-User as a user-initiated agent and removed its robots.txt compliance requirement. Disallow rules for ChatGPT-User are now ignored; interactive ChatGPT browsing can only be blocked at the CDN/WAF layer.
  • Perplexity stealth crawling documented: Cloudflare reported in August 2025 that Perplexity rotates user-agents and ASNs to evade blocks and has been observed ignoring robots.txt. Treat PerplexityBot and Perplexity-User allow-listing as directional only; use WAF rules for any hard block.
  • Bytespider ignores it: ByteDance's Bytespider is documented to not respect robots.txt — block at CDN/WAF level. See Cloudflare WAF custom rules for setup.
  • No legal enforcement: robots.txt does not prevent crawling. It signals intent. Legal protection requires ToS, CFAA claims, or contractual agreements.
  • EU AI Act alignment: The EU regulatory framework encourages GPAI providers to document and respect publisher opt-out signals — robots.txt disallow for training crawlers is the de facto mechanism. Verify specific commitments against the published Code of Practice text as obligations evolve.

Provider User-Agent Reference

Provider Training Search index User retrieval
OpenAI GPTBot OAI-SearchBot ChatGPT-User*
Anthropic ClaudeBot Claude-SearchBot Claude-User
Google Google-Extended (standard Googlebot) Google-CloudVertexBot
Perplexity (PerplexityBot serves both) PerplexityBot Perplexity-User
Meta Meta-ExternalAgent Meta-ExternalFetcher

*ChatGPT-User — no longer bound by robots.txt as of OpenAI's December 2025 policy update; block at CDN/WAF if required.

Why Allow Tier 1

Blocking all AI crawlers has a compounding cost:

  • Retrieval bots power citation-eligible AI answers — being absent means competitors fill that space
  • AI-referred sessions grew substantially year-over-year through 2025; blocking Tier 1 opts out of this traffic source entirely
  • Cloudflare data shows the crawl-to-referral ratio for OpenAI is ~1,700:1 and Anthropic ~73,000:1 — training crawlers give no referral return; retrieval bots give direct search traffic

Key Takeaways

  • The three-tier taxonomy (retrieval / training / non-compliant) maps directly to three distinct robots.txt strategies: allow / disallow / CDN block
  • Blocking training crawlers does not block retrieval bots — they use separate user-agent strings
  • robots.txt compliance is voluntary; most major providers respect it, but ChatGPT-User was exempted in December 2025 and Perplexity has been documented evading blocks — use CDN/WAF rules when hard enforcement is required
  • The default strategy for documentation sites: allow Tier 1, disallow Tier 2, WAF-block Bytespider
Feedback