The Complete Guide to robots.txt: Rules, Examples, and AI Crawlers

When a search engine or an AI model visits your website, the first file it looks for is robots.txt. This small text file at your domain root tells crawlers which parts of your site they can fetch and which parts to skip. The file has been around since 1994, but the arrival of AI crawlers like GPTBot, ClaudeBot, and PerplexityBot has given it new weight. This guide covers the syntax, the patterns that solve real problems, how to handle AI crawlers, and the mistakes that quietly break the file without any warning.
What robots.txt Is and What It Cannot Do
A robots.txt file is a plain text document served at https://yourdomain.com/robots.txt. When a crawler visits a site for the first time in a session, it fetches this file before touching anything else. The file lists rules like “skip this directory” or “do not visit this path”. Crawlers that follow the Robots Exclusion Protocol read the file and adjust their behavior.
The key word in that sentence is follow. robots.txt is a politeness mechanism, not a security mechanism. Well behaved crawlers from Google, Bing, OpenAI, and Anthropic read the file and respect the rules. Malicious scrapers ignore it completely. If you have private data on a URL, putting that URL in Disallow does not hide it. The URL is still public. Anyone who knows the path can open it in a browser.
A second common misconception is that robots.txt prevents indexing. It does not. It prevents crawling. If Googlebot cannot crawl a page (because it is disallowed) but finds the URL through external links, Google may still list the URL in search results, often with the text “No information is available for this page”. To prevent indexing, you need a noindex directive in a meta tag or HTTP header, which the bot can only see if it is allowed to crawl the page.
A third point worth clarifying: robots.txt does not replace XML sitemaps. It can reference them, but the two files serve different jobs. robots.txt is about exclusion; a sitemap is about discovery.
So why use robots.txt at all? Because for good actors, which is the vast majority of meaningful traffic to your site, it offers clean, explicit control. It saves crawl budget on large sites by keeping bots away from infinite URL spaces like faceted navigation or search result pages. It tells Google which sitemap to find. And as of 2024, it has become the main tool site owners use to signal consent or refusal to AI crawlers.
The Syntax in One Page

The syntax has four directives that cover nearly every real use case: User-agent, Disallow, Allow, and Sitemap.
A minimal valid file looks like this:
User-agent: *
Disallow: /admin/
Allow: /admin/public/
User-agent: Googlebot
Disallow: /private/
Sitemap: https://example.com/sitemap.xmlThe file is organized in groups. A group starts with one or more User-agent lines and ends at the next group or at the end of the file. The wildcard * matches any crawler not named elsewhere. Named user agents create their own group.
Disallow is a path prefix match. Disallow: /admin/ blocks any URL starting with /admin/. Disallow: / blocks the entire site for that user agent. An empty Disallow: is equivalent to allow everything.
Allow is the exception to a Disallow rule. It lets you carve out a subdirectory inside a disallowed parent. When Disallow and Allow conflict, most modern crawlers apply the most specific (longest) match. Older RFC clients used the first match, so keep the more specific rule on top if you target older parsers.
Two wildcards are widely supported: * matches any sequence of characters in a path, and $ anchors the end of a URL. For example, Disallow: /*.pdf$ blocks any URL ending in .pdf.
A few rules most people forget:
User-agentmatching is case insensitive.Googlebotandgooglebotrefer to the same bot.- Path matching is case sensitive.
/Admin/and/admin/are different paths. Sitemapdirectives live outside any group. MultipleSitemaplines are allowed, one per line, using absolute URLs.- Comments start with
#and run to end of line.
If you want the full specification, Google Search Central’s introduction to robots.txt is the most accessible reference, and the IETF RFC 9309 specification is the official protocol document.
Common Patterns That Actually Solve Problems
Five patterns cover perhaps 90 percent of real use cases.
Block internal tools from all bots. Admin panels, dashboards, and internal APIs should not be crawled. These URLs often return HTML that is noindex by design, but keeping them out of the crawl budget is cleaner.
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/Block on site search result pages. Search queries generate nearly infinite URL variants. Letting crawlers follow them wastes crawl budget and produces low quality pages in search results.
User-agent: *
Disallow: /search
Disallow: /*?q=Control faceted navigation. E commerce sites with color, size, and brand filters can explode into millions of parameter combinations. Block the parameter patterns that have no SEO value. Learn more about how crawl budget works and when to restrict parameter URLs.
User-agent: *
Disallow: /*?color=
Disallow: /*?sort=
Disallow: /*?view=Handle staging subdomains carefully. A live robots.txt that blocks everything is a common way to isolate staging. The risk is that the same file accidentally ships to production. A safer approach is HTTP basic auth or IP allowlists on the staging server. If you must use robots.txt:
# staging.example.com/robots.txt
User-agent: *
Disallow: /Just remember to replace this file before the site goes live. Many teams have lost weeks of traffic to this single line.
Point crawlers to your sitemap. One line at the end of the file saves a round trip for every good bot.
Sitemap: https://example.com/sitemap.xmlIf you maintain multiple sitemaps, list each one. The full discovery chain, from robots.txt to the sitemap index to individual sitemap files, is explored in more depth in the complete guide to SEO crawlers.
AI Crawlers: GPTBot, ClaudeBot, and the Block vs Allow Decision

As of 2026, a new set of crawlers is visiting your site. They are not indexing for a traditional search engine. They are gathering content for large language models that answer questions directly. Whether you want this to happen is an editorial decision, and robots.txt is where you express it.
The main AI bots to know:
| Bot | Company | Purpose |
|---|---|---|
GPTBot | OpenAI | Training data for ChatGPT |
OAI-SearchBot | OpenAI | Search results in ChatGPT |
ChatGPT-User | OpenAI | User triggered page fetches inside ChatGPT |
ClaudeBot | Anthropic | Training and search for Claude |
Claude-Web | Anthropic | User triggered fetches inside Claude |
PerplexityBot | Perplexity | Search index for Perplexity answers |
Perplexity-User | Perplexity | User triggered fetches |
Google-Extended | Training for Gemini (separate from Googlebot) | |
CCBot | Common Crawl | Open web archive used by many AI models |
Amazonbot | Amazon | Training for Amazon AI products |
Bytespider | ByteDance | Training for ByteDance models |
Applebot-Extended | Apple | Training for Apple Intelligence |
The block versus allow decision comes down to what you are optimizing for.
Block all AI training bots if your content has commercial or licensing value, or if you simply prefer not to contribute to model training. A common pattern:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /Allow AI search but block training. Some bots are used for real time answer generation, not for building training sets. If you want to appear in ChatGPT Search or Perplexity answers while staying out of training data, allow the search focused bots and block the training ones:
# Allow AI search
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: Perplexity-User
Allow: /
# Block AI training
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /Allow everything if your content strategy depends on maximum visibility across both traditional search and AI answers. This is increasingly the choice for publishers, SaaS marketing sites, and documentation. The reasoning is straightforward: AI search is a growing channel, and being cited in AI answers sends qualified traffic. The tradeoff is that your content becomes part of the knowledge these models surface with or without attribution.
Seodisias checks your site against 14 known AI bots as part of its AI Ready analysis, so you can see which ones you have allowed, which ones you have blocked, and where your robots.txt is silent (which implicitly allows). The broader picture of optimizing for AI search is covered in the Generative Engine Optimization playbook.
One nuance worth noting: some of these bots are identified in request headers but not always in User-agent strings you can target in robots.txt. robots.txt only works for bots that announce themselves and honor the protocol. A bot that wants to scrape will scrape. The goal here is to manage the well behaved majority, not to build a fortress. For the authoritative bot names and behavior, check each vendor’s documentation, such as OpenAI’s GPTBot page and Anthropic’s Claude crawler reference.
Validation and Common Mistakes
A broken robots.txt often fails silently. The file is still served, bots still read it, but a single typo can change the meaning entirely. A few tools and habits reduce the risk.
Test your file before trusting it.
- Google Search Console used to offer a dedicated robots.txt Tester. The equivalent today is URL Inspection, which shows how Googlebot parses and applies the rules to a specific URL on your site.
- Public validators like the one at
technicalseo.com/tools/robots-txtparse the file and highlight syntax errors. - For quick checks,
curl https://yourdomain.com/robots.txtis enough to confirm the file is served and returns HTTP 200.
Common mistakes to scan for:
- Accidentally blocking the whole site. A single
Disallow: /underUser-agent: *removes every page from every honest bot. This usually happens when a staging file ships to production. - Blocking CSS and JavaScript. Modern crawlers, including Googlebot, render pages. If you block
/static/,/assets/, or/js/, the renderer sees a broken page and may penalize the ranking. - Case sensitivity errors in paths.
Disallow: /Admin/does not block/admin/. Match the case your URLs actually use. - Missing trailing slash.
Disallow: /privateblocks/privateand/private/pageand also/private-stuff.Disallow: /private/is more surgical and blocks only paths under the/private/directory. - Wildcard in the wrong place.
Disallow: /*.pdfblocks any URL containing.pdf, which is almost never what you meant.Disallow: /*.pdf$blocks URLs that end in.pdf, which usually is. - Syntax that looks right but is not. An extra space before the colon, a smart quote character copied from a document, a line ending in the wrong format on Windows. Any of these can cause parsers to skip an entire group. Always author
robots.txtin a plain text editor. - Forgetting the
Sitemapdirective. Omitting this line is not an error but it is a missed opportunity. Bots find sitemaps elsewhere, but listing it inrobots.txtis the fastest path.
A good habit is to re run your full site audit after any change to robots.txt and confirm that ruling a section out did not cascade into unintended side effects.
Conclusion
robots.txt is the smallest file on your site that has the biggest potential to change how search engines and AI models see you. It is one line away from blocking a critical section, one directive away from welcoming every training crawler, one typo away from silently undoing your SEO. The habit that protects you is simple: edit carefully, validate with a tool, and audit the effect on the full site after every change. If you want to automate the audit and see exactly which of the 14 major AI bots you have allowed or blocked, download Seodisias and run a crawl on your own machine. No sign up, no upload, all data stays with you.