robots.txt: The Complete Guide to Controlling Search Engine Crawlers
The robots.txt file is one of the oldest and most important tools in a webmaster's SEO toolkit. It sits at the root of your domain and tells search engine crawlers which parts of your site they should or should not access. Despite its simplicity, misconfiguring robots.txt can accidentally hide your site from Google, waste crawl budget on unimportant pages, or expose private content to unwanted bots. This guide covers everything you need to know to write effective robots.txt rules.
What is robots.txt?
robots.txt is a plain text file placed at your domain root (e.g. https://example.com/robots.txt). It follows the Robots Exclusion Protocol (REP), a standard that web crawlers voluntarily obey. When a crawler visits your site, it checks robots.txt first to see which paths are allowed or disallowed before crawling any pages.
It is important to understand that robots.txt is a guideline, not a security mechanism. Well-behaved bots like Googlebot and Bingbot respect it, but malicious scrapers may ignore it entirely. Never use robots.txt to hide sensitive data. Use authentication and access controls instead.
Basic Syntax
A robots.txt file consists of one or more rule groups. Each group starts with a User-agent line and contains Disallow and/or Allow directives:
User-agent: *
Disallow: /admin/
Disallow: /api/
Allow: /api/public/
Sitemap: https://example.com/sitemap.xml
- User-agent: specifies which crawler the rules apply to. Use
*for all bots, or a specific name likeGooglebot. - Disallow: blocks crawling of the specified path prefix.
Disallow: /blocks everything. - Allow: overrides a broader Disallow rule for a specific path. Useful for whitelisting subdirectories.
- Sitemap: points crawlers to your XML sitemap for better content discovery.
Common robots.txt Examples
Allow All Crawling
User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml
An empty Disallow means nothing is blocked. This is the most permissive configuration.
Block Everything
User-agent: *
Disallow: /
This blocks all crawlers from all pages. Useful for staging environments that should not appear in search results.
Block Specific Directories
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /private/
Disallow: /tmp/
Allow: /
Sitemap: https://example.com/sitemap.xml
Blocking AI Training Bots
With the rise of large language models, many site owners want to prevent AI companies from using their content for training data. Several AI crawlers have announced that they respect robots.txt. Here is how to block the most common ones:
# Block OpenAI
User-agent: GPTBot
Disallow: /
# Block OpenAI browsing
User-agent: ChatGPT-User
Disallow: /
# Block Common Crawl (used by many AI models)
User-agent: CCBot
Disallow: /
# Block Google AI training
User-agent: Google-Extended
Disallow: /
Note that blocking Google-Extended only prevents Google from using your content for AI training. It does not affect regular Google Search indexing, which uses the Googlebot user agent.
Crawl-Delay Directive
The Crawl-delay directive tells crawlers to wait a specified number of seconds between requests. This helps reduce server load from aggressive bots:
User-agent: *
Crawl-delay: 10
Google does not officially support Crawl-delay (use Google Search Console instead), but Bing, Yandex and many other crawlers respect it.
Pattern Matching
robots.txt supports two wildcard characters for pattern matching:
- * (asterisk) matches any sequence of characters.
Disallow: /*.jsonblocks all URLs ending in .json. - $ (dollar sign) matches the end of the URL.
Disallow: /*.pdf$blocks URLs that end exactly with .pdf but not /pdf-guide/.
Disallow: /search → blocks /search, /search?q=test, /search/results
Disallow: /search$ → blocks only /search exactly
Disallow: /*.xml$ → blocks all .xml files
Disallow: /*/admin/ → blocks /en/admin/, /fr/admin/, etc.
Common Mistakes to Avoid
- Blocking CSS and JS files. Google needs to render your pages. If you block stylesheets or scripts, Google cannot properly index your content and your rankings may suffer.
- Using robots.txt for security. robots.txt is publicly visible and does not prevent access. Use authentication, firewalls and access control lists to protect sensitive content.
- Forgetting the trailing slash.
Disallow: /adminblocks /admin, /admin/, and /administrator. If you only want to block the /admin/ directory, useDisallow: /admin/. - Not including a Sitemap. Always add a Sitemap directive. It helps crawlers discover all your pages, especially ones that might not be linked from your navigation.
- Blocking your entire site accidentally. A single
Disallow: /underUser-agent: *will deindex your entire site. Always double-check before deploying. - Not testing after changes. Use Google Search Console's robots.txt tester to verify your rules work as expected before pushing to production.
Where to Place robots.txt
robots.txt must be placed at the root of your domain. It will not work in subdirectories. For different subdomains, you need separate robots.txt files:
example.com/robots.txt → controls crawling for example.com
blog.example.com/robots.txt → controls crawling for blog.example.com
example.com/blog/robots.txt → this will NOT work
Generate your robots.txt in seconds
Use our visual editor to configure user agents, allow/disallow rules, crawl delays and sitemap URL. Includes presets for blocking AI bots and common configurations.
Open robots.txt Generator