Guide · AI crawler protection
How to block AI crawlers with robots.txt
A practical, copy-paste guide to keeping GPTBot, Google-Extended, CCBot and other AI training bots out of your content — without hurting your search rankings.
Why this matters
Large language model providers crawl the open web to train and ground their models. If you publish original writing, illustrations, photography, code, or product copy, those bots are very likely scraping it. robots.txt is still the simplest, most widely-respected way to say "no" — but only if you use the right user-agent names.
GPTBot vs Google-Extended vs CCBot — what's the difference?
- GPTBot — OpenAI's training crawler for ChatGPT and future GPT models. Blocking it removes your content from OpenAI's training set going forward. It does not affect ChatGPT's live browsing tool (which uses
OAI-SearchBotandChatGPT-User). - Google-Extended — Google's opt-out token for Gemini and Vertex AI training. It is not a separate bot; it's a directive that the regular Googlebot honors. Blocking it does not affect your Google Search ranking — Googlebot continues to index normally.
- CCBot — Common Crawl's crawler. Its dataset is the raw material for hundreds of AI models, including early GPT, LLaMA, and many open-source projects. Blocking CCBot is the single highest-leverage rule in this guide.
The copy-paste block (recommended)
Add this to your robots.txt at the root of your site (e.g. https://example.com/robots.txt). It blocks the major AI training crawlers while leaving Googlebot, Bingbot and other search crawlers untouched.
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: cohere-ai
Disallow: /
# Keep search engines welcome
User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xmlBlock only training, allow AI search
Some publishers want to appear in AI answers (where they may be cited with a link) but not become training data. In that case, allow the live-retrieval agents and only block the training ones:
# Block AI training
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
# Allow AI live retrieval / answers
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: *
Allow: /Partial blocks (protect only some content)
You can scope rules to specific paths — useful for shielding premium articles, gated docs, or member areas while leaving marketing pages open.
User-agent: GPTBot
Disallow: /members/
Disallow: /premium/
Disallow: /docs/internal/
User-agent: CCBot
Disallow: /members/
Disallow: /premium/How to check if your website blocks AI bot crawlers
Three quick checks before you trust your rules:
- Open
https://yourdomain.com/robots.txtin a browser. The file must load and show the directives above — a 404 or HTML page means it isn't being served. - Check that user-agent names match exactly (case-insensitive but spelling matters:
GPTBot, notGPT-Bot;Google-Extended, notGoogleExtended). - Run a free CrawlFence scan — it checks robots.txt rules, AI-bot directives, hosting, CDN, and legal notices (TDM reservations, Impressum) and tells you exactly which AI bots can still reach your content.
The limits of robots.txt
robots.txt is a voluntary standard. The major named bots above honor it; unethical scrapers don't. For comprehensive protection you also need:
- Cloudflare AI Bot Block — one-click network-level blocking that catches bots even when they ignore robots.txt.
- TDM Reservation language — a legal opt-out under EU copyright law (Article 4 of the DSM Directive) published in your terms and ai.txt.
- Meta tags like
<meta name="robots" content="noai, noimageai">for per-page signals.
Check your site in 30 seconds
CrawlFence scans your robots.txt, hosting, CDN, and legal notices and reports exactly which AI crawlers can still reach your content — free.
Run a free scan