robots.txt for AI Crawlers: The 2026 Guide to GPTBot

Your robots.txt controls who can see your site. Most are set wrong.

The robots.txt file is the first thing any well-behaved crawler checks before visiting your site. It's a simple text file that tells bots what they can and can't access.

In the age of AI search, your robots.txt configuration has become a strategic decision: do you want ChatGPT, Perplexity, Claude, and Gemini to know your product exists? It's a foundational part of answer engine optimization (AEO) — getting your content into AI-generated answers.

For most businesses, the answer is yes. But many sites accidentally block AI crawlers — and don't realize they're opting out of the fastest-growing discovery channel.

Every AI crawler you need to know about

Here's the complete list of AI crawler user-agent strings as of March 2026:

AI Search & Assistant Crawlers

Bot	User-Agent String	Operated By	Purpose
GPTBot	`GPTBot`	OpenAI	Trains ChatGPT models, powers ChatGPT browsing
ChatGPT-User	`ChatGPT-User`	OpenAI	Real-time browsing in ChatGPT conversations
ClaudeBot	`ClaudeBot`	Anthropic	Trains Claude models
PerplexityBot	`PerplexityBot`	Perplexity	Powers Perplexity search answers
Cohere-ai	`cohere-ai`	Cohere	Trains Cohere language models
Meta-ExternalAgent	`Meta-ExternalAgent`	Meta	Powers Meta AI features
Bytespider	`Bytespider`	ByteDance/TikTok	AI training and content indexing
CCBot	`CCBot`	Common Crawl	Open dataset used by many AI companies
Google-Extended	`Google-Extended`	Google	Training data for Gemini and AI features
Amazonbot	`Amazonbot`	Amazon	Powers Alexa and Amazon AI features

Traditional Search Engine Crawlers

Bot	User-Agent String	Operated By
Googlebot	`Googlebot`	Google
Bingbot	`bingbot`	Microsoft
DuckDuckBot	`DuckDuckBot`	DuckDuckGo
YandexBot	`YandexBot`	Yandex
Baiduspider	`Baiduspider`	Baidu

Bot	User-Agent String	Platform
FacebookExternalHit	`facebookexternalhit`	Facebook/Instagram
Twitterbot	`Twitterbot`	X (Twitter)
LinkedInBot	`LinkedInBot`	LinkedIn
WhatsApp	`WhatsApp`	WhatsApp
Slackbot	`Slackbot`	Slack
Discordbot	`Discordbot`	Discord

Copy-paste robots.txt configurations

Configuration 1: Allow everything (recommended for most sites)

User-agent: *
Allow: /

Sitemap: https://your-site.com/sitemap.xml

This allows all crawlers — search engines, AI bots, and social bots. This is the best default for most businesses that want maximum visibility.

Configuration 2: Explicitly allow AI crawlers

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: cohere-ai
Allow: /

User-agent: Meta-ExternalAgent
Allow: /

Sitemap: https://your-site.com/sitemap.xml

Explicitly listing AI bots makes your intent clear. Some crawlers check for specific directives before falling back to the wildcard.

Configuration 3: Allow search + AI, protect private areas

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /dashboard/
Disallow: /account/

User-agent: GPTBot
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: ClaudeBot
Allow: /
Disallow: /admin/
Disallow: /api/

User-agent: PerplexityBot
Allow: /
Disallow: /admin/
Disallow: /api/

Sitemap: https://your-site.com/sitemap.xml

Allows public pages while protecting authenticated and private routes.

Configuration 4: Allow search engines, block AI training

User-agent: *
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Still allow real-time AI browsing and search
User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://your-site.com/sitemap.xml

This blocks AI model training crawlers while allowing real-time AI search bots. Note: the distinction between "training" and "search" isn't always clear-cut, and blocking GPTBot also prevents your content from appearing in ChatGPT's browsing results.

Configuration 5: Block all AI crawlers (not recommended for most sites)

User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: *
Allow: /

Sitemap: https://your-site.com/sitemap.xml

This blocks all known AI crawlers while allowing traditional search engines. Some content publishers use this to protect their content from being used in AI training without compensation.

The strategic case for allowing AI crawlers

Why you should allow them

AI search is growing exponentially. GPTBot +305% YoY. PerplexityBot +157,490% YoY. Blocking these bots means opting out of a rapidly growing discovery channel.
AI citations drive real traffic. Perplexity, ChatGPT, and Gemini link back to sources. AI referral traffic is up 52% year-over-year for ChatGPT alone. (For the mechanics behind which sources get cited, see how AI search engines choose citations.)
AI visibility compounds. Once AI systems know about your product, they cite you in future answers. Early visibility creates a durable advantage.
Your competitors are doing it. If you block AI crawlers and your competitors don't, AI systems will recommend them instead of you.
The content is already public. If your content is accessible via a web browser, blocking AI crawlers is a policy statement more than a security measure. Determined systems can access public content through other means.

When blocking makes sense

Paid content behind paywalls. If your business model depends on content access being gated, blocking AI training crawlers is reasonable.
Large content publishers. Media companies with millions of articles may want compensation for AI training use of their content.
Sensitive information. If your site contains information that shouldn't be freely reproduced in AI answers.

For most SaaS companies, startups, and solopreneurs — allow AI crawlers. The visibility benefit far outweighs any downside.

How to check your current robots.txt

Test 1: View it directly

curl https://your-site.com/robots.txt

Test 2: Check for AI bot blocks

curl -s https://your-site.com/robots.txt | grep -i -A1 "gptbot\|claudebot\|perplexitybot\|cohere\|ccbot"

If you see Disallow: / after any AI bot user-agent, that bot is blocked.

Test 3: Check for accidental blocks

Some configurations accidentally block bots:

# This blocks EVERYTHING including all bots:
User-agent: *
Disallow: /

This is more common than you'd think, especially on staging sites that get promoted to production.

Common mistakes

Mistake 1: No robots.txt at all

If your site has no robots.txt, most crawlers will assume everything is allowed — which is fine. But having an explicit robots.txt with a sitemap reference helps crawlers discover your content more efficiently.

Mistake 2: Blocking everything by default

User-agent: *
Disallow: /

This blocks all crawlers. Sometimes left over from development or staging environments.

Mistake 3: CMS/hosting platform defaults

Some platforms add AI bot blocks by default. WordPress security plugins, Cloudflare settings, and server configurations can all inject robots.txt rules you didn't explicitly set. Check regularly.

Mistake 4: Inconsistent casing

robots.txt is case-sensitive for user-agent strings on some crawlers. Use the exact casing from the bot documentation: GPTBot, not gptbot or GPTBOT.

Mistake 5: Forgetting the sitemap

Always include your sitemap URL:

Sitemap: https://your-site.com/sitemap.xml

This helps crawlers discover all your pages, especially important for SPAs where internal links might not be visible without JavaScript.

robots.txt is necessary but not sufficient

Here's the critical point: allowing AI crawlers in robots.txt doesn't mean they can read your content.

If your site is a JavaScript SPA, AI crawlers are allowed to visit — but when they do, they see an empty HTML shell. The door is open, but the house is empty.

This is the "split visibility problem" we see constantly in CrawlReady audits: robots.txt allows all bots, but the content visibility gap is 90%+ because everything is rendered by JavaScript.

You need both:

robots.txt that allows AI crawlers (this guide)
HTML that contains your actual content (pre-rendering or SSR)

Next steps

Check your robots.txt using the commands above
Run a CrawlReady audit to see if your content is actually visible to crawlers, regardless of robots.txt
Read our guide on what crawlers actually see for the full picture
Check our 5 signs of AI visibility problems for a quick diagnostic

AI crawler user-agent strings are current as of March 2026. New AI bots appear regularly. Check CrawlReady's documentation for the latest bot detection list.

Is your site invisible to AI search?

Run a free audit and see exactly what Google, ChatGPT, Perplexity, and 20+ crawlers see on your site. Results in 15 seconds.

Run Free Audit

Share:Post Share

#robots-txt#gptbot#claudebot#perplexitybot#ai-crawlers#configuration

robots.txt for AI Crawlers: The 2026 Guide to GPTBot, ClaudeBot, and PerplexityBot

Your robots.txt controls who can see your site. Most are set wrong.

Every AI crawler you need to know about

AI Search & Assistant Crawlers

Traditional Search Engine Crawlers

Copy-paste robots.txt configurations

Configuration 1: Allow everything (recommended for most sites)

Configuration 2: Explicitly allow AI crawlers

Configuration 3: Allow search + AI, protect private areas

Configuration 4: Allow search engines, block AI training

Configuration 5: Block all AI crawlers (not recommended for most sites)

The strategic case for allowing AI crawlers

Why you should allow them

When blocking makes sense

How to check your current robots.txt

Test 1: View it directly

Test 2: Check for AI bot blocks

Test 3: Check for accidental blocks

Common mistakes

Mistake 1: No robots.txt at all

Mistake 2: Blocking everything by default

Mistake 3: CMS/hosting platform defaults

Mistake 4: Inconsistent casing

Mistake 5: Forgetting the sitemap

robots.txt is necessary but not sufficient

Next steps

Is your site invisible to AI search?

Related posts

The AI Builder SEO Checklist: 15 Things to Fix After Lovable, Bolt, or Base44 Generates Your Site

AEO vs SEO: Do You Need Both in 2026? (Yes — Here's Why)

How Indie Hackers and Solopreneurs Can Compete with Venture-Backed Startups on Search

Your robots.txt controls who can see your site. Most are set wrong.

Every AI crawler you need to know about

AI Search & Assistant Crawlers

Traditional Search Engine Crawlers

Social Media Bots

Copy-paste robots.txt configurations

Configuration 1: Allow everything (recommended for most sites)

Configuration 2: Explicitly allow AI crawlers

Configuration 3: Allow search + AI, protect private areas

Configuration 4: Allow search engines, block AI training

Configuration 5: Block all AI crawlers (not recommended for most sites)

The strategic case for allowing AI crawlers

Why you should allow them

When blocking makes sense

How to check your current robots.txt

Test 1: View it directly

Test 2: Check for AI bot blocks

Test 3: Check for accidental blocks

Common mistakes

Mistake 1: No robots.txt at all

Mistake 2: Blocking everything by default

Mistake 3: CMS/hosting platform defaults

Mistake 4: Inconsistent casing

Mistake 5: Forgetting the sitemap

robots.txt is necessary but not sufficient

Next steps

Is your site invisible to AI search?

Related posts

The AI Builder SEO Checklist: 15 Things to Fix After Lovable, Bolt, or Base44 Generates Your Site

AEO vs SEO: Do You Need Both in 2026? (Yes — Here's Why)

How Indie Hackers and Solopreneurs Can Compete with Venture-Backed Startups on Search