All posts
Guide

robots.txt for AI Crawlers: The 2026 Guide to GPTBot, ClaudeBot, and PerplexityBot

The definitive guide to configuring robots.txt for AI crawlers in 2026. Every known AI bot user-agent string, copy-paste configurations, and the strategic case for allowing vs. blocking.

CrawlReady TeamMarch 27, 20264 min read
Share:PostShare

Your robots.txt controls who can see your site. Most are set wrong.

The robots.txt file is the first thing any well-behaved crawler checks before visiting your site. It's a simple text file that tells bots what they can and can't access.

In the age of AI search, your robots.txt configuration has become a strategic decision: do you want ChatGPT, Perplexity, Claude, and Gemini to know your product exists? It's a foundational part of answer engine optimization (AEO) — getting your content into AI-generated answers.

For most businesses, the answer is yes. But many sites accidentally block AI crawlers — and don't realize they're opting out of the fastest-growing discovery channel.


Every AI crawler you need to know about

Here's the complete list of AI crawler user-agent strings as of March 2026:

AI Search & Assistant Crawlers

BotUser-Agent StringOperated ByPurpose
GPTBotGPTBotOpenAITrains ChatGPT models, powers ChatGPT browsing
ChatGPT-UserChatGPT-UserOpenAIReal-time browsing in ChatGPT conversations
ClaudeBotClaudeBotAnthropicTrains Claude models
PerplexityBotPerplexityBotPerplexityPowers Perplexity search answers
Cohere-aicohere-aiCohereTrains Cohere language models
Meta-ExternalAgentMeta-ExternalAgentMetaPowers Meta AI features
BytespiderBytespiderByteDance/TikTokAI training and content indexing
CCBotCCBotCommon CrawlOpen dataset used by many AI companies
Google-ExtendedGoogle-ExtendedGoogleTraining data for Gemini and AI features
AmazonbotAmazonbotAmazonPowers Alexa and Amazon AI features

Traditional Search Engine Crawlers

BotUser-Agent StringOperated By
GooglebotGooglebotGoogle
BingbotbingbotMicrosoft
DuckDuckBotDuckDuckBotDuckDuckGo
YandexBotYandexBotYandex
BaiduspiderBaiduspiderBaidu

Social Media Bots

BotUser-Agent StringPlatform
FacebookExternalHitfacebookexternalhitFacebook/Instagram
TwitterbotTwitterbotX (Twitter)
LinkedInBotLinkedInBotLinkedIn
WhatsAppWhatsAppWhatsApp
SlackbotSlackbotSlack
DiscordbotDiscordbotDiscord

Copy-paste robots.txt configurations

User-agent: *
Allow: /

Sitemap: https://your-site.com/sitemap.xml

This allows all crawlers — search engines, AI bots, and social bots. This is the best default for most businesses that want maximum visibility.

Configuration 2: Explicitly allow AI crawlers

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

Sitemap: https://your-site.com/sitemap.xml

Explicitly listing AI bots makes your intent clear. Some crawlers check for specific directives before falling back to the wildcard.

Configuration 3: Allow search + AI, protect private areas

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Disallow: /account/

User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: PerplexityBot
Allow: /
Disallow: /admin/
Disallow: /api/

Sitemap: https://your-site.com/sitemap.xml

Allows public pages while protecting authenticated and private routes.

Configuration 4: Allow search engines, block AI training

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Still allow real-time AI browsing and search
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://your-site.com/sitemap.xml

This blocks AI model training crawlers while allowing real-time AI search bots. Note: the distinction between "training" and "search" isn't always clear-cut, and blocking GPTBot also prevents your content from appearing in ChatGPT's browsing results.

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /

Sitemap: https://your-site.com/sitemap.xml

This blocks all known AI crawlers while allowing traditional search engines. Some content publishers use this to protect their content from being used in AI training without compensation.


The strategic case for allowing AI crawlers

Why you should allow them

  1. AI search is growing exponentially. GPTBot +305% YoY. PerplexityBot +157,490% YoY. Blocking these bots means opting out of a rapidly growing discovery channel.

  2. AI citations drive real traffic. Perplexity, ChatGPT, and Gemini link back to sources. AI referral traffic is up 52% year-over-year for ChatGPT alone. (For the mechanics behind which sources get cited, see how AI search engines choose citations.)

  3. AI visibility compounds. Once AI systems know about your product, they cite you in future answers. Early visibility creates a durable advantage.

  4. Your competitors are doing it. If you block AI crawlers and your competitors don't, AI systems will recommend them instead of you.

  5. The content is already public. If your content is accessible via a web browser, blocking AI crawlers is a policy statement more than a security measure. Determined systems can access public content through other means.

When blocking makes sense

  • Paid content behind paywalls. If your business model depends on content access being gated, blocking AI training crawlers is reasonable.
  • Large content publishers. Media companies with millions of articles may want compensation for AI training use of their content.
  • Sensitive information. If your site contains information that shouldn't be freely reproduced in AI answers.

For most SaaS companies, startups, and solopreneurs — allow AI crawlers. The visibility benefit far outweighs any downside.


How to check your current robots.txt

Test 1: View it directly

curl https://your-site.com/robots.txt

Test 2: Check for AI bot blocks

curl -s https://your-site.com/robots.txt | grep -i -A1 "gptbot\|claudebot\|perplexitybot\|cohere\|ccbot"

If you see Disallow: / after any AI bot user-agent, that bot is blocked.

Test 3: Check for accidental blocks

Some configurations accidentally block bots:

# This blocks EVERYTHING including all bots:
User-agent: *
Disallow: /

This is more common than you'd think, especially on staging sites that get promoted to production.


Common mistakes

Mistake 1: No robots.txt at all

If your site has no robots.txt, most crawlers will assume everything is allowed — which is fine. But having an explicit robots.txt with a sitemap reference helps crawlers discover your content more efficiently.

Mistake 2: Blocking everything by default

User-agent: *
Disallow: /

This blocks all crawlers. Sometimes left over from development or staging environments.

Mistake 3: CMS/hosting platform defaults

Some platforms add AI bot blocks by default. WordPress security plugins, Cloudflare settings, and server configurations can all inject robots.txt rules you didn't explicitly set. Check regularly.

Mistake 4: Inconsistent casing

robots.txt is case-sensitive for user-agent strings on some crawlers. Use the exact casing from the bot documentation: GPTBot, not gptbot or GPTBOT.

Mistake 5: Forgetting the sitemap

Always include your sitemap URL:

Sitemap: https://your-site.com/sitemap.xml

This helps crawlers discover all your pages, especially important for SPAs where internal links might not be visible without JavaScript.


robots.txt is necessary but not sufficient

Here's the critical point: allowing AI crawlers in robots.txt doesn't mean they can read your content.

If your site is a JavaScript SPA, AI crawlers are allowed to visit — but when they do, they see an empty HTML shell. The door is open, but the house is empty.

This is the "split visibility problem" we see constantly in CrawlReady audits: robots.txt allows all bots, but the content visibility gap is 90%+ because everything is rendered by JavaScript.

You need both:

  1. robots.txt that allows AI crawlers (this guide)
  2. HTML that contains your actual content (pre-rendering or SSR)

Next steps

  1. Check your robots.txt using the commands above
  2. Run a CrawlReady audit to see if your content is actually visible to crawlers, regardless of robots.txt
  3. Read our guide on what crawlers actually see for the full picture
  4. Check our 5 signs of AI visibility problems for a quick diagnostic

AI crawler user-agent strings are current as of March 2026. New AI bots appear regularly. Check CrawlReady's documentation for the latest bot detection list.

Run a free audit and see exactly what Google, ChatGPT, Perplexity, and 20+ crawlers see on your site. Results in 15 seconds.

Run Free Audit
Share:PostShare
#robots-txt#gptbot#claudebot#perplexitybot#ai-crawlers#configuration