AI Crawler Policy: robots.txt for the Three-Tier Crawler Landscape¶

AI crawlers split into retrieval bots (allow for citations), training scrapers (disallow), and non-compliant bots (WAF block) — each requiring a distinct robots.txt strategy.

The Three-Tier Taxonomy¶

AI crawlers are not monolithic. Each major provider now operates separate bots for distinct purposes, each with its own user-agent string:

Tier	Purpose	User-agents	robots.txt behaviour
Tier 1 — User-facing retrieval	Powers real-time citations in AI chat and search	`ChatGPT-User`*, `OAI-SearchBot`, `Claude-User`, `Claude-SearchBot`, `PerplexityBot`†, `Perplexity-User`†	Allow — drives referral traffic and AI citations
Tier 2 — Training scrapers	Ingests content for model training datasets	`GPTBot`, `ClaudeBot`, `Google-Extended`, `Meta-ExternalAgent`	Disallow — no citation benefit; opts out of training data
Tier 3 — Non-compliant bots	Crawlers documented to ignore robots.txt	`Bytespider` (ByteDance)	CDN/WAF block — robots.txt is ineffective

The tier distinction matters: blocking training crawlers without also blocking retrieval bots keeps your content eligible for AI search citations while opting out of model training datasets.

* As of OpenAI's December 2025 policy update, ChatGPT-User no longer respects robots.txt; disallow rules are ignored (coverage).

† Cloudflare documented Perplexity rotating user-agents and ASNs to bypass robots.txt (August 2025 report). Use WAF for hard blocks.

Decision Matrix¶

Goal	Action
Appear in AI search answers (ChatGPT, Claude, Perplexity)	Allow Tier 1
Prevent content entering training datasets	Disallow Tier 2
Stop ByteDance/Bytespider from crawling	WAF custom rule
Opt out of everything	Disallow all AI user-agents + WAF

The emerging practitioner consensus for documentation sites: allow Tier 1, disallow Tier 2.

Reference Configuration¶

This site's robots.txt implements the three-tier policy:

# ── Default: allow all standard crawlers ──────────────────────────────────────
User-agent: *
Allow: /

# ── Tier 1: User-facing retrieval bots (ALLOW) ────────────────────────────────

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

# ── Tier 2: Training scrapers (DISALLOW) ──────────────────────────────────────

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

# ── Tier 3: CDN-level block (robots.txt ineffective) ──────────────────────────
# Bytespider — configure WAF custom rule: User-Agent contains "Bytespider" → Block

Sitemap: https://agentpatterns.ai/sitemap.xml

Compliance Caveats¶

robots.txt is advisory, not enforceable. Key nuances:

Major providers comply: OpenAI (GPTBot, OAI-SearchBot), Anthropic (ClaudeBot, Claude-SearchBot, Claude-User), and Google (Google-Extended) respect robots.txt directives.
ChatGPT-User exempt (Dec 2025): OpenAI's updated crawler documentation reclassified ChatGPT-User as a user-initiated agent and removed its robots.txt compliance requirement. Disallow rules for ChatGPT-User are now ignored; interactive ChatGPT browsing can only be blocked at the CDN/WAF layer.
Perplexity stealth crawling documented: Cloudflare reported in August 2025 that Perplexity rotates user-agents and ASNs to evade blocks and has been observed ignoring robots.txt. Treat PerplexityBot and Perplexity-User allow-listing as directional only; use WAF rules for any hard block.
Bytespider ignores it: ByteDance's Bytespider is documented to not respect robots.txt — block at CDN/WAF level. See Cloudflare WAF custom rules for setup.
No legal enforcement: robots.txt does not prevent crawling. It signals intent. Legal protection requires ToS, CFAA claims, or contractual agreements.
EU AI Act alignment: The EU regulatory framework encourages GPAI providers to document and respect publisher opt-out signals — robots.txt disallow for training crawlers is the de facto mechanism. Verify specific commitments against the published Code of Practice text as obligations evolve.

Provider User-Agent Reference¶

Provider	Training	Search index	User retrieval
OpenAI	`GPTBot`	`OAI-SearchBot`	`ChatGPT-User`*
Anthropic	`ClaudeBot`	`Claude-SearchBot`	`Claude-User`
Google	`Google-Extended`	(standard Googlebot)	`Google-CloudVertexBot`
Perplexity	(PerplexityBot serves both)	`PerplexityBot`	`Perplexity-User`
Meta	`Meta-ExternalAgent`	`Meta-ExternalFetcher`	—

*ChatGPT-User — no longer bound by robots.txt as of OpenAI's December 2025 policy update; block at CDN/WAF if required.

Why Allow Tier 1¶

Blocking all AI crawlers has a compounding cost:

Retrieval bots power citation-eligible AI answers — being absent means competitors fill that space
AI-referred sessions grew substantially year-over-year through 2025; blocking Tier 1 opts out of this traffic source entirely
Cloudflare data shows the crawl-to-referral ratio for OpenAI is ~1,700:1 and Anthropic ~73,000:1 — training crawlers give no referral return; retrieval bots give direct search traffic

Key Takeaways¶

The three-tier taxonomy (retrieval / training / non-compliant) maps directly to three distinct robots.txt strategies: allow / disallow / CDN block
Blocking training crawlers does not block retrieval bots — they use separate user-agent strings
robots.txt compliance is voluntary; most major providers respect it, but ChatGPT-User was exempted in December 2025 and Perplexity has been documented evading blocks — use CDN/WAF rules when hard enforcement is required
The default strategy for documentation sites: allow Tier 1, disallow Tier 2, WAF-block Bytespider