AI Crawler Policy: robots.txt for the Three-Tier Crawler Landscape¶
AI crawlers split into retrieval bots (allow for citations), training scrapers (disallow), and non-compliant bots (WAF block) — each requiring a distinct robots.txt strategy.
The Three-Tier Taxonomy¶
AI crawlers are not monolithic. Each major provider now operates separate bots for distinct purposes, each with its own user-agent string:
| Tier | Purpose | User-agents | robots.txt behaviour |
|---|---|---|---|
| Tier 1 — User-facing retrieval | Powers real-time citations in AI chat and search | ChatGPT-User*, OAI-SearchBot, Claude-User, Claude-SearchBot, PerplexityBot†, Perplexity-User† |
Allow — drives referral traffic and AI citations |
| Tier 2 — Training scrapers | Ingests content for model training datasets | GPTBot, ClaudeBot, Google-Extended, Meta-ExternalAgent |
Disallow — no citation benefit; opts out of training data |
| Tier 3 — Non-compliant bots | Crawlers documented to ignore robots.txt | Bytespider (ByteDance) |
CDN/WAF block — robots.txt is ineffective |
The tier distinction matters: blocking training crawlers without also blocking retrieval bots keeps your content eligible for AI search citations while opting out of model training datasets.
* As of OpenAI's December 2025 policy update, ChatGPT-User no longer respects robots.txt; disallow rules are ignored (coverage).
† Cloudflare documented Perplexity rotating user-agents and ASNs to bypass robots.txt (August 2025 report). Use WAF for hard blocks.
Decision Matrix¶
| Goal | Action |
|---|---|
| Appear in AI search answers (ChatGPT, Claude, Perplexity) | Allow Tier 1 |
| Prevent content entering training datasets | Disallow Tier 2 |
| Stop ByteDance/Bytespider from crawling | WAF custom rule |
| Opt out of everything | Disallow all AI user-agents + WAF |
The emerging practitioner consensus for documentation sites: allow Tier 1, disallow Tier 2.
Reference Configuration¶
This site's robots.txt implements the three-tier policy:
# ── Default: allow all standard crawlers ──────────────────────────────────────
User-agent: *
Allow: /
# ── Tier 1: User-facing retrieval bots (ALLOW) ────────────────────────────────
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: Claude-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# ── Tier 2: Training scrapers (DISALLOW) ──────────────────────────────────────
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
# ── Tier 3: CDN-level block (robots.txt ineffective) ──────────────────────────
# Bytespider — configure WAF custom rule: User-Agent contains "Bytespider" → Block
Sitemap: https://agentpatterns.ai/sitemap.xml
Compliance Caveats¶
robots.txt is advisory, not enforceable. Key nuances:
- Major providers comply: OpenAI (GPTBot, OAI-SearchBot), Anthropic (ClaudeBot, Claude-SearchBot, Claude-User), and Google (Google-Extended) respect robots.txt directives.
- ChatGPT-User exempt (Dec 2025): OpenAI's updated crawler documentation reclassified
ChatGPT-Useras a user-initiated agent and removed its robots.txt compliance requirement. Disallow rules forChatGPT-Userare now ignored; interactive ChatGPT browsing can only be blocked at the CDN/WAF layer. - Perplexity stealth crawling documented: Cloudflare reported in August 2025 that Perplexity rotates user-agents and ASNs to evade blocks and has been observed ignoring robots.txt. Treat
PerplexityBotandPerplexity-Userallow-listing as directional only; use WAF rules for any hard block. - Bytespider ignores it: ByteDance's Bytespider is documented to not respect robots.txt — block at CDN/WAF level. See Cloudflare WAF custom rules for setup.
- No legal enforcement: robots.txt does not prevent crawling. It signals intent. Legal protection requires ToS, CFAA claims, or contractual agreements.
- EU AI Act alignment: The EU regulatory framework encourages GPAI providers to document and respect publisher opt-out signals —
robots.txtdisallow for training crawlers is the de facto mechanism. Verify specific commitments against the published Code of Practice text as obligations evolve.
Provider User-Agent Reference¶
| Provider | Training | Search index | User retrieval |
|---|---|---|---|
| OpenAI | GPTBot |
OAI-SearchBot |
ChatGPT-User* |
| Anthropic | ClaudeBot |
Claude-SearchBot |
Claude-User |
Google-Extended |
(standard Googlebot) | Google-CloudVertexBot |
|
| Perplexity | (PerplexityBot serves both) | PerplexityBot |
Perplexity-User |
| Meta | Meta-ExternalAgent |
Meta-ExternalFetcher |
— |
*ChatGPT-User — no longer bound by robots.txt as of OpenAI's December 2025 policy update; block at CDN/WAF if required.
Why Allow Tier 1¶
Blocking all AI crawlers has a compounding cost:
- Retrieval bots power citation-eligible AI answers — being absent means competitors fill that space
- AI-referred sessions grew substantially year-over-year through 2025; blocking Tier 1 opts out of this traffic source entirely
- Cloudflare data shows the crawl-to-referral ratio for OpenAI is ~1,700:1 and Anthropic ~73,000:1 — training crawlers give no referral return; retrieval bots give direct search traffic
Key Takeaways¶
- The three-tier taxonomy (retrieval / training / non-compliant) maps directly to three distinct robots.txt strategies: allow / disallow / CDN block
- Blocking training crawlers does not block retrieval bots — they use separate user-agent strings
- robots.txt compliance is voluntary; most major providers respect it, but
ChatGPT-Userwas exempted in December 2025 and Perplexity has been documented evading blocks — use CDN/WAF rules when hard enforcement is required - The default strategy for documentation sites: allow Tier 1, disallow Tier 2, WAF-block Bytespider