Your robots.txt controls who can see your site. Most are set wrong.
The robots.txt file is the first thing any well-behaved crawler checks before visiting your site. It's a simple text file that tells bots what they can and can't access.
In the age of AI search, your robots.txt configuration has become a strategic decision: do you want ChatGPT, Perplexity, Claude, and Gemini to know your product exists? It's a foundational part of answer engine optimization (AEO) — getting your content into AI-generated answers.
For most businesses, the answer is yes. But many sites accidentally block AI crawlers — and don't realize they're opting out of the fastest-growing discovery channel.
Every AI crawler you need to know about
Here's the complete list of AI crawler user-agent strings as of March 2026:
AI Search & Assistant Crawlers
| Bot | User-Agent String | Operated By | Purpose |
|---|---|---|---|
| GPTBot | GPTBot | OpenAI | Trains ChatGPT models, powers ChatGPT browsing |
| ChatGPT-User | ChatGPT-User | OpenAI | Real-time browsing in ChatGPT conversations |
| ClaudeBot | ClaudeBot | Anthropic | Trains Claude models |
| PerplexityBot | PerplexityBot | Perplexity | Powers Perplexity search answers |
| Cohere-ai | cohere-ai | Cohere | Trains Cohere language models |
| Meta-ExternalAgent | Meta-ExternalAgent | Meta | Powers Meta AI features |
| Bytespider | Bytespider | ByteDance/TikTok | AI training and content indexing |
| CCBot | CCBot | Common Crawl | Open dataset used by many AI companies |
| Google-Extended | Google-Extended | Training data for Gemini and AI features | |
| Amazonbot | Amazonbot | Amazon | Powers Alexa and Amazon AI features |
Traditional Search Engine Crawlers
| Bot | User-Agent String | Operated By |
|---|---|---|
| Googlebot | Googlebot | |
| Bingbot | bingbot | Microsoft |
| DuckDuckBot | DuckDuckBot | DuckDuckGo |
| YandexBot | YandexBot | Yandex |
| Baiduspider | Baiduspider | Baidu |
Social Media Bots
| Bot | User-Agent String | Platform |
|---|---|---|
| FacebookExternalHit | facebookexternalhit | Facebook/Instagram |
| Twitterbot | Twitterbot | X (Twitter) |
| LinkedInBot | LinkedInBot | |
WhatsApp | ||
| Slackbot | Slackbot | Slack |
| Discordbot | Discordbot | Discord |
Copy-paste robots.txt configurations
Configuration 1: Allow everything (recommended for most sites)
User-agent: *
Allow: /
Sitemap: https://your-site.com/sitemap.xml
This allows all crawlers — search engines, AI bots, and social bots. This is the best default for most businesses that want maximum visibility.
Configuration 2: Explicitly allow AI crawlers
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: cohere-ai
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
Sitemap: https://your-site.com/sitemap.xml
Explicitly listing AI bots makes your intent clear. Some crawlers check for specific directives before falling back to the wildcard.
Configuration 3: Allow search + AI, protect private areas
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Disallow: /account/
User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /api/
User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /api/
User-agent: PerplexityBot
Allow: /
Disallow: /admin/
Disallow: /api/
Sitemap: https://your-site.com/sitemap.xml
Allows public pages while protecting authenticated and private routes.
Configuration 4: Allow search engines, block AI training
User-agent: *
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Still allow real-time AI browsing and search
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
Sitemap: https://your-site.com/sitemap.xml
This blocks AI model training crawlers while allowing real-time AI search bots. Note: the distinction between "training" and "search" isn't always clear-cut, and blocking GPTBot also prevents your content from appearing in ChatGPT's browsing results.
Configuration 5: Block all AI crawlers (not recommended for most sites)
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: *
Allow: /
Sitemap: https://your-site.com/sitemap.xml
This blocks all known AI crawlers while allowing traditional search engines. Some content publishers use this to protect their content from being used in AI training without compensation.
The strategic case for allowing AI crawlers
Why you should allow them
-
AI search is growing exponentially. GPTBot +305% YoY. PerplexityBot +157,490% YoY. Blocking these bots means opting out of a rapidly growing discovery channel.
-
AI citations drive real traffic. Perplexity, ChatGPT, and Gemini link back to sources. AI referral traffic is up 52% year-over-year for ChatGPT alone. (For the mechanics behind which sources get cited, see how AI search engines choose citations.)
-
AI visibility compounds. Once AI systems know about your product, they cite you in future answers. Early visibility creates a durable advantage.
-
Your competitors are doing it. If you block AI crawlers and your competitors don't, AI systems will recommend them instead of you.
-
The content is already public. If your content is accessible via a web browser, blocking AI crawlers is a policy statement more than a security measure. Determined systems can access public content through other means.
When blocking makes sense
- Paid content behind paywalls. If your business model depends on content access being gated, blocking AI training crawlers is reasonable.
- Large content publishers. Media companies with millions of articles may want compensation for AI training use of their content.
- Sensitive information. If your site contains information that shouldn't be freely reproduced in AI answers.
For most SaaS companies, startups, and solopreneurs — allow AI crawlers. The visibility benefit far outweighs any downside.
How to check your current robots.txt
Test 1: View it directly
curl https://your-site.com/robots.txt
Test 2: Check for AI bot blocks
curl -s https://your-site.com/robots.txt | grep -i -A1 "gptbot\|claudebot\|perplexitybot\|cohere\|ccbot"
If you see Disallow: / after any AI bot user-agent, that bot is blocked.
Test 3: Check for accidental blocks
Some configurations accidentally block bots:
# This blocks EVERYTHING including all bots:
User-agent: *
Disallow: /
This is more common than you'd think, especially on staging sites that get promoted to production.
Common mistakes
Mistake 1: No robots.txt at all
If your site has no robots.txt, most crawlers will assume everything is allowed — which is fine. But having an explicit robots.txt with a sitemap reference helps crawlers discover your content more efficiently.
Mistake 2: Blocking everything by default
User-agent: *
Disallow: /
This blocks all crawlers. Sometimes left over from development or staging environments.
Mistake 3: CMS/hosting platform defaults
Some platforms add AI bot blocks by default. WordPress security plugins, Cloudflare settings, and server configurations can all inject robots.txt rules you didn't explicitly set. Check regularly.
Mistake 4: Inconsistent casing
robots.txt is case-sensitive for user-agent strings on some crawlers. Use the exact casing from the bot documentation: GPTBot, not gptbot or GPTBOT.
Mistake 5: Forgetting the sitemap
Always include your sitemap URL:
Sitemap: https://your-site.com/sitemap.xml
This helps crawlers discover all your pages, especially important for SPAs where internal links might not be visible without JavaScript.
robots.txt is necessary but not sufficient
Here's the critical point: allowing AI crawlers in robots.txt doesn't mean they can read your content.
If your site is a JavaScript SPA, AI crawlers are allowed to visit — but when they do, they see an empty HTML shell. The door is open, but the house is empty.
This is the "split visibility problem" we see constantly in CrawlReady audits: robots.txt allows all bots, but the content visibility gap is 90%+ because everything is rendered by JavaScript.
You need both:
- robots.txt that allows AI crawlers (this guide)
- HTML that contains your actual content (pre-rendering or SSR)
Next steps
- Check your robots.txt using the commands above
- Run a CrawlReady audit to see if your content is actually visible to crawlers, regardless of robots.txt
- Read our guide on what crawlers actually see for the full picture
- Check our 5 signs of AI visibility problems for a quick diagnostic
AI crawler user-agent strings are current as of March 2026. New AI bots appear regularly. Check CrawlReady's documentation for the latest bot detection list.
Is your site invisible to AI search?
Run a free audit and see exactly what Google, ChatGPT, Perplexity, and 20+ crawlers see on your site. Results in 15 seconds.
Run Free Audit