The Complete Guide to robots.txt: Rules, Examples, and AI Crawlers

When a search engine or an AI model visits your website, the first file it looks for is robots.txt. This small text file at your domain root tells crawlers which parts of your site they can fetch and which parts to skip. The file has been around since 1994, but the arrival of AI crawlers like GPTBot, ClaudeBot, and PerplexityBot has given it new weight. This guide covers the syntax, the patterns that solve real problems, how to handle AI crawlers, and the mistakes that quietly break the file without any warning.

What robots.txt Is and What It Cannot Do

A robots.txt file is a plain text document served at https://yourdomain.com/robots.txt. When a crawler visits a site for the first time in a session, it fetches this file before touching anything else. The file lists rules like “skip this directory” or “do not visit this path”. Crawlers that follow the Robots Exclusion Protocol read the file and adjust their behavior.

The key word in that sentence is follow. robots.txt is a politeness mechanism, not a security mechanism. Well behaved crawlers from Google, Bing, OpenAI, and Anthropic read the file and respect the rules. Malicious scrapers ignore it completely. If you have private data on a URL, putting that URL in Disallow does not hide it. The URL is still public. Anyone who knows the path can open it in a browser.

A second common misconception is that robots.txt prevents indexing. It does not. It prevents crawling. If Googlebot cannot crawl a page (because it is disallowed) but finds the URL through external links, Google may still list the URL in search results, often with the text “No information is available for this page”. To prevent indexing, you need a noindex directive in a meta tag or HTTP header, which the bot can only see if it is allowed to crawl the page.

A third point worth clarifying: robots.txt does not replace XML sitemaps. It can reference them, but the two files serve different jobs. robots.txt is about exclusion; a sitemap is about discovery.

So why use robots.txt at all? Because for good actors, which is the vast majority of meaningful traffic to your site, it offers clean, explicit control. It saves crawl budget on large sites by keeping bots away from infinite URL spaces like faceted navigation or search result pages. It tells Google which sitemap to find. And as of 2024, it has become the main tool site owners use to signal consent or refusal to AI crawlers.

The Syntax in One Page

Four mosaic tablets showing the four key robots.txt directives: User Agent, Disallow, Allow, and Sitemap

The syntax has four directives that cover nearly every real use case: User-agent, Disallow, Allow, and Sitemap.

A minimal valid file looks like this:

User-agent: *
Disallow: /admin/
Allow: /admin/public/

User-agent: Googlebot
Disallow: /private/

Sitemap: https://example.com/sitemap.xml

The file is organized in groups. A group starts with one or more User-agent lines and ends at the next group or at the end of the file. The wildcard * matches any crawler not named elsewhere. Named user agents create their own group.

Disallow is a path prefix match. Disallow: /admin/ blocks any URL starting with /admin/. Disallow: / blocks the entire site for that user agent. An empty Disallow: is equivalent to allow everything.

Allow is the exception to a Disallow rule. It lets you carve out a subdirectory inside a disallowed parent. When Disallow and Allow conflict, most modern crawlers apply the most specific (longest) match. Older RFC clients used the first match, so keep the more specific rule on top if you target older parsers.

Two wildcards are widely supported: * matches any sequence of characters in a path, and $ anchors the end of a URL. For example, Disallow: /*.pdf$ blocks any URL ending in .pdf.

A few rules most people forget:

User-agent matching is case insensitive. Googlebot and googlebot refer to the same bot.
Path matching is case sensitive. /Admin/ and /admin/ are different paths.
Sitemap directives live outside any group. Multiple Sitemap lines are allowed, one per line, using absolute URLs.
Comments start with # and run to end of line.

If you want the full specification, Google Search Central’s introduction to robots.txt is the most accessible reference, and the IETF RFC 9309 specification is the official protocol document.

Common Patterns That Actually Solve Problems

Five patterns cover perhaps 90 percent of real use cases.

Block internal tools from all bots. Admin panels, dashboards, and internal APIs should not be crawled. These URLs often return HTML that is noindex by design, but keeping them out of the crawl budget is cleaner.

User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/

Block on site search result pages. Search queries generate nearly infinite URL variants. Letting crawlers follow them wastes crawl budget and produces low quality pages in search results.

User-agent: *
Disallow: /search
Disallow: /*?q=

Control faceted navigation. E commerce sites with color, size, and brand filters can explode into millions of parameter combinations. Block the parameter patterns that have no SEO value. Learn more about how crawl budget works and when to restrict parameter URLs.

User-agent: *
Disallow: /*?color=
Disallow: /*?sort=
Disallow: /*?view=

Handle staging subdomains carefully. A live robots.txt that blocks everything is a common way to isolate staging. The risk is that the same file accidentally ships to production. A safer approach is HTTP basic auth or IP allowlists on the staging server. If you must use robots.txt:

# staging.example.com/robots.txt
User-agent: *
Disallow: /

Just remember to replace this file before the site goes live. Many teams have lost weeks of traffic to this single line.

Point crawlers to your sitemap. One line at the end of the file saves a round trip for every good bot.

Sitemap: https://example.com/sitemap.xml

If you maintain multiple sitemaps, list each one. The full discovery chain, from robots.txt to the sitemap index to individual sitemap files, is explored in more depth in the complete guide to SEO crawlers.

AI Crawlers: GPTBot, ClaudeBot, and the Block vs Allow Decision

A mosaic scene showing three crawler figures approaching an ancient gate, one passing through while another turns away

As of 2026, a new set of crawlers is visiting your site. They are not indexing for a traditional search engine. They are gathering content for large language models that answer questions directly. Whether you want this to happen is an editorial decision, and robots.txt is where you express it.

The main AI bots to know:

Bot	Company	Purpose
`GPTBot`	OpenAI	Training data for ChatGPT
`OAI-SearchBot`	OpenAI	Search results in ChatGPT
`ChatGPT-User`	OpenAI	User triggered page fetches inside ChatGPT
`ClaudeBot`	Anthropic	Training and search for Claude
`Claude-Web`	Anthropic	User triggered fetches inside Claude
`PerplexityBot`	Perplexity	Search index for Perplexity answers
`Perplexity-User`	Perplexity	User triggered fetches
`Google-Extended`	Google	Training for Gemini (separate from Googlebot)
`CCBot`	Common Crawl	Open web archive used by many AI models
`Amazonbot`	Amazon	Training for Amazon AI products
`Bytespider`	ByteDance	Training for ByteDance models
`Applebot-Extended`	Apple	Training for Apple Intelligence

The block versus allow decision comes down to what you are optimizing for.

Block all AI training bots if your content has commercial or licensing value, or if you simply prefer not to contribute to model training. A common pattern:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Allow AI search but block training. Some bots are used for real time answer generation, not for building training sets. If you want to appear in ChatGPT Search or Perplexity answers while staying out of training data, allow the search focused bots and block the training ones:

# Allow AI search
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Perplexity-User
Allow: /

# Block AI training
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Allow everything if your content strategy depends on maximum visibility across both traditional search and AI answers. This is increasingly the choice for publishers, SaaS marketing sites, and documentation. The reasoning is straightforward: AI search is a growing channel, and being cited in AI answers sends qualified traffic. The tradeoff is that your content becomes part of the knowledge these models surface with or without attribution.

Seodisias checks your site against 14 known AI bots as part of its AI Ready analysis, so you can see which ones you have allowed, which ones you have blocked, and where your robots.txt is silent (which implicitly allows). The broader picture of optimizing for AI search is covered in the Generative Engine Optimization playbook.

One nuance worth noting: some of these bots are identified in request headers but not always in User-agent strings you can target in robots.txt. robots.txt only works for bots that announce themselves and honor the protocol. A bot that wants to scrape will scrape. The goal here is to manage the well behaved majority, not to build a fortress. For the authoritative bot names and behavior, check each vendor’s documentation, such as OpenAI’s GPTBot page and Anthropic’s Claude crawler reference.

Validation and Common Mistakes

A broken robots.txt often fails silently. The file is still served, bots still read it, but a single typo can change the meaning entirely. A few tools and habits reduce the risk.

Test your file before trusting it.

Google Search Console used to offer a dedicated robots.txt Tester. The equivalent today is URL Inspection, which shows how Googlebot parses and applies the rules to a specific URL on your site.
Public validators like the one at technicalseo.com/tools/robots-txt parse the file and highlight syntax errors.
For quick checks, curl https://yourdomain.com/robots.txt is enough to confirm the file is served and returns HTTP 200.

Common mistakes to scan for:

Accidentally blocking the whole site. A single Disallow: / under User-agent: * removes every page from every honest bot. This usually happens when a staging file ships to production.
Blocking CSS and JavaScript. Modern crawlers, including Googlebot, render pages. If you block /static/, /assets/, or /js/, the renderer sees a broken page and may penalize the ranking.
Case sensitivity errors in paths. Disallow: /Admin/ does not block /admin/. Match the case your URLs actually use.
Missing trailing slash. Disallow: /private blocks /private and /private/page and also /private-stuff. Disallow: /private/ is more surgical and blocks only paths under the /private/ directory.
Wildcard in the wrong place. Disallow: /*.pdf blocks any URL containing .pdf, which is almost never what you meant. Disallow: /*.pdf$ blocks URLs that end in .pdf, which usually is.
Syntax that looks right but is not. An extra space before the colon, a smart quote character copied from a document, a line ending in the wrong format on Windows. Any of these can cause parsers to skip an entire group. Always author robots.txt in a plain text editor.
Forgetting the Sitemap directive. Omitting this line is not an error but it is a missed opportunity. Bots find sitemaps elsewhere, but listing it in robots.txt is the fastest path.

A good habit is to re run your full site audit after any change to robots.txt and confirm that ruling a section out did not cascade into unintended side effects.

Conclusion

robots.txt is the smallest file on your site that has the biggest potential to change how search engines and AI models see you. It is one line away from blocking a critical section, one directive away from welcoming every training crawler, one typo away from silently undoing your SEO. The habit that protects you is simple: edit carefully, validate with a tool, and audit the effect on the full site after every change. If you want to automate the audit and see exactly which of the 14 major AI bots you have allowed or blocked, download Seodisias and run a crawl on your own machine. No sign up, no upload, all data stays with you.

The Complete Guide to robots.txt: Rules, Examples, and AI Crawlers

What robots.txt Is and What It Cannot Do

The Syntax in One Page

Common Patterns That Actually Solve Problems

AI Crawlers: GPTBot, ClaudeBot, and the Block vs Allow Decision

Validation and Common Mistakes

Conclusion

Related Posts

llms.txt: Do AI Engines Actually Read It?

Best SEO Crawlers in 2026: Free, Open Source, and Commercial Tools Compared

Schema Markup for SEO and AI Search: A Practical Guide