Skip to content

robots.txt: The Complete Guide to Controlling Search Engine Crawlers

The robots.txt file is one of the oldest and most important tools in a webmaster's SEO toolkit. It sits at the root of your domain and tells search engine crawlers which parts of your site they should or should not access. Despite its simplicity, misconfiguring robots.txt can accidentally hide your site from Google, waste crawl budget on unimportant pages, or expose private content to unwanted bots. This guide covers everything you need to know to write effective robots.txt rules.

What is robots.txt?

robots.txt is a plain text file placed at your domain root (e.g. https://example.com/robots.txt). It follows the Robots Exclusion Protocol (REP), a standard that web crawlers voluntarily obey. When a crawler visits your site, it checks robots.txt first to see which paths are allowed or disallowed before crawling any pages.

It is important to understand that robots.txt is a guideline, not a security mechanism. Well-behaved bots like Googlebot and Bingbot respect it, but malicious scrapers may ignore it entirely. Never use robots.txt to hide sensitive data. Use authentication and access controls instead.

Basic Syntax

A robots.txt file consists of one or more rule groups. Each group starts with a User-agent line and contains Disallow and/or Allow directives:

User-agent: *

Disallow: /admin/

Disallow: /api/

Allow: /api/public/

Sitemap: https://example.com/sitemap.xml

  • User-agent: specifies which crawler the rules apply to. Use * for all bots, or a specific name like Googlebot.
  • Disallow: blocks crawling of the specified path prefix. Disallow: / blocks everything.
  • Allow: overrides a broader Disallow rule for a specific path. Useful for whitelisting subdirectories.
  • Sitemap: points crawlers to your XML sitemap for better content discovery.

Common robots.txt Examples

Allow All Crawling

User-agent: *

Disallow:

Sitemap: https://example.com/sitemap.xml

An empty Disallow means nothing is blocked. This is the most permissive configuration.

Block Everything

User-agent: *

Disallow: /

This blocks all crawlers from all pages. Useful for staging environments that should not appear in search results.

Block Specific Directories

User-agent: *

Disallow: /admin/

Disallow: /api/

Disallow: /private/

Disallow: /tmp/

Allow: /

Sitemap: https://example.com/sitemap.xml

Blocking AI Training Bots

With the rise of large language models, many site owners want to prevent AI companies from using their content for training data. Several AI crawlers have announced that they respect robots.txt. Here is how to block the most common ones:

# Block OpenAI

User-agent: GPTBot

Disallow: /

# Block OpenAI browsing

User-agent: ChatGPT-User

Disallow: /

# Block Common Crawl (used by many AI models)

User-agent: CCBot

Disallow: /

# Block Google AI training

User-agent: Google-Extended

Disallow: /

Note that blocking Google-Extended only prevents Google from using your content for AI training. It does not affect regular Google Search indexing, which uses the Googlebot user agent.

Crawl-Delay Directive

The Crawl-delay directive tells crawlers to wait a specified number of seconds between requests. This helps reduce server load from aggressive bots:

User-agent: *

Crawl-delay: 10

Google does not officially support Crawl-delay (use Google Search Console instead), but Bing, Yandex and many other crawlers respect it.

Pattern Matching

robots.txt supports two wildcard characters for pattern matching:

  • * (asterisk) matches any sequence of characters. Disallow: /*.json blocks all URLs ending in .json.
  • $ (dollar sign) matches the end of the URL. Disallow: /*.pdf$ blocks URLs that end exactly with .pdf but not /pdf-guide/.

Disallow: /search → blocks /search, /search?q=test, /search/results

Disallow: /search$ → blocks only /search exactly

Disallow: /*.xml$ → blocks all .xml files

Disallow: /*/admin/ → blocks /en/admin/, /fr/admin/, etc.

Common Mistakes to Avoid

  • Blocking CSS and JS files. Google needs to render your pages. If you block stylesheets or scripts, Google cannot properly index your content and your rankings may suffer.
  • Using robots.txt for security. robots.txt is publicly visible and does not prevent access. Use authentication, firewalls and access control lists to protect sensitive content.
  • Forgetting the trailing slash. Disallow: /admin blocks /admin, /admin/, and /administrator. If you only want to block the /admin/ directory, use Disallow: /admin/.
  • Not including a Sitemap. Always add a Sitemap directive. It helps crawlers discover all your pages, especially ones that might not be linked from your navigation.
  • Blocking your entire site accidentally. A single Disallow: / under User-agent: * will deindex your entire site. Always double-check before deploying.
  • Not testing after changes. Use Google Search Console's robots.txt tester to verify your rules work as expected before pushing to production.

Where to Place robots.txt

robots.txt must be placed at the root of your domain. It will not work in subdirectories. For different subdomains, you need separate robots.txt files:

example.com/robots.txt → controls crawling for example.com

blog.example.com/robots.txt → controls crawling for blog.example.com

example.com/blog/robots.txt → this will NOT work

Generate your robots.txt in seconds

Use our visual editor to configure user agents, allow/disallow rules, crawl delays and sitemap URL. Includes presets for blocking AI bots and common configurations.

Open robots.txt Generator