Does robots.txt actually block AI crawlers?

Reputable AI crawlers like OpenAI's GPTBot, Google-Extended, and Common Crawl's CCBot honor robots.txt directives. Less ethical scrapers may ignore them — for those you need network-level blocks (Cloudflare AI Bot Block, IP/WAF rules).

Does blocking Google-Extended hurt my Google search ranking?

No. Google-Extended is a separate user agent that only controls whether Google can use your content for Gemini and Vertex AI training. It does not affect Googlebot or your search rankings.

How do I check if my website blocks AI bot crawlers?

Run a free CrawlFence scan — it inspects your robots.txt, AI-bot directives, and hosting layer to confirm whether GPTBot, Google-Extended, CCBot and similar agents are blocked.

How to Block AI Crawlers with robots.txt (GPTBot, Google-Extended, CCBot)

A practical, copy-paste guide to keeping GPTBot, Google-Extended, CCBot and other AI training bots out of your content — without hurting your search rankings.

Why this matters

Large language model providers crawl the open web to train and ground their models. If you publish original writing, illustrations, photography, code, or product copy, those bots are very likely scraping it. robots.txt is still the simplest, most widely-respected way to say "no" — but only if you use the right user-agent names.

GPTBot vs Google-Extended vs CCBot — what's the difference?

GPTBot — OpenAI's training crawler for ChatGPT and future GPT models. Blocking it removes your content from OpenAI's training set going forward. It does not affect ChatGPT's live browsing tool (which uses OAI-SearchBot and ChatGPT-User).
Google-Extended — Google's opt-out token for Gemini and Vertex AI training. It is not a separate bot; it's a directive that the regular Googlebot honors. Blocking it does not affect your Google Search ranking — Googlebot continues to index normally.
CCBot — Common Crawl's crawler. Its dataset is the raw material for hundreds of AI models, including early GPT, LLaMA, and many open-source projects. Blocking CCBot is the single highest-leverage rule in this guide.

The copy-paste block (recommended)

Add this to your robots.txt at the root of your site (e.g. https://example.com/robots.txt). It blocks the major AI training crawlers while leaving Googlebot, Bingbot and other search crawlers untouched.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: cohere-ai
Disallow: /

# Keep search engines welcome
User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Block only training, allow AI search

Some publishers want to appear in AI answers (where they may be cited with a link) but not become training data. In that case, allow the live-retrieval agents and only block the training ones:

# Block AI training
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

# Allow AI live retrieval / answers
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /

Partial blocks (protect only some content)

You can scope rules to specific paths — useful for shielding premium articles, gated docs, or member areas while leaving marketing pages open.

User-agent: GPTBot
Disallow: /members/
Disallow: /premium/
Disallow: /docs/internal/

User-agent: CCBot
Disallow: /members/
Disallow: /premium/

How to check if your website blocks AI bot crawlers

Three quick checks before you trust your rules:

Open https://yourdomain.com/robots.txt in a browser. The file must load and show the directives above — a 404 or HTML page means it isn't being served.
Check that user-agent names match exactly (case-insensitive but spelling matters: GPTBot, not GPT-Bot; Google-Extended, not GoogleExtended).
Run a free CrawlFence scan — it checks robots.txt rules, AI-bot directives, hosting, CDN, and legal notices (TDM reservations, Impressum) and tells you exactly which AI bots can still reach your content.

The limits of robots.txt

robots.txt is a voluntary standard. The major named bots above honor it; unethical scrapers don't. For comprehensive protection you also need:

Cloudflare AI Bot Block — one-click network-level blocking that catches bots even when they ignore robots.txt.
TDM Reservation language — a legal opt-out under EU copyright law (Article 4 of the DSM Directive) published in your terms and ai.txt.
Meta tags like <meta name="robots" content="noai, noimageai"> for per-page signals.

Check your site in 30 seconds

CrawlFence scans your robots.txt, hosting, CDN, and legal notices and reports exactly which AI crawlers can still reach your content — free.

Run a free scan

How to block AI crawlers with robots.txt