Robots.txt Generator
Build a robots.txt with a live URL tester and an RFC 9309 (Robots Exclusion Protocol) compliance linter. WordPress, e-commerce & blog presets included.
About Robots.txt Generator
A professional robots.txt generator tool that helps you create, validate and test robots.txt files for your website. Control how search engine crawlers access your site with an easy-to-use interface. Essential for SEO optimization and web security.
What is robots.txt?
Robots.txt is a text file placed in your website's root directory that tells search engine crawlers which pages or sections of your site should not be crawled or indexed. It follows the Robots Exclusion Protocol (REP) and is one of the fundamental tools for managing your site's relationship with search engines.
Key purposes:
• Control crawler access to prevent server overload
• Keep duplicate or low-value pages out of search results
• Manage crawl budget on large sites
• Block access to private or staging areas
• Prevent indexing of search results or filtered pages
Note: robots.txt is NOT a security mechanism - it only provides guidance to well-behaved bots. Use proper authentication for truly private content.
How does robots.txt work?
When a search engine bot visits your website, it first checks for robots.txt at:
https://yoursite.com/robots.txt
The file contains directives that specify:
• User-agent: Which bots the rules apply to (* means all bots)
• Disallow: Paths that should NOT be crawled
• Allow: Paths that CAN be crawled (overrides disallow)
• Sitemap: Location of your XML sitemap
• Crawl-delay: Delay between requests (in seconds)
Example:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /private/public.html
Sitemap: https://yoursite.com/sitemap.xml
Most reputable search engines respect these directives, but malicious bots may ignore them.
What should I block in robots.txt?
Common items to block:
**Administrative Areas:**
• /admin/, /administrator/, /wp-admin/
• /login/, /signin/, /account/
• Control panels and backend systems
**Technical Folders:**
• /cgi-bin/, /tmp/, /temp/
• /includes/, /scripts/
• Development and staging areas
**Duplicate Content:**
• Search result pages (/search/, /?s=)
• Filtered or sorted product pages
• Printer-friendly versions
• Session ID URLs
**Private Data:**
• /private/, /confidential/
• Customer data directories
• Internal documents
**Resource Files (Sometimes):**
• /wp-content/plugins/ (WordPress)
• /wp-includes/ (WordPress core)
DO NOT BLOCK:
• CSS and JavaScript files needed for rendering
• Important content pages
• Your sitemap
• Product/category pages
What are User-agents?
User-agents identify specific bots or crawlers. Common ones:
**Search Engines:**
• Googlebot - Google's web crawler
• Bingbot - Microsoft Bing crawler
• Slurp - Yahoo crawler
• DuckDuckBot - DuckDuckGo crawler
• Baiduspider - Baidu (Chinese search engine)
• YandexBot - Yandex (Russian search engine)
**Social Media:**
• facebookexternalhit - Facebook crawler
• Twitterbot - Twitter crawler
• LinkedInBot - LinkedIn crawler
**SEO Tools:**
• AhrefsBot - Ahrefs SEO tool
• SemrushBot - SEMrush SEO tool
• MJ12bot - Majestic SEO
**Others:**
• * - Wildcard for all bots
You can set different rules for different user-agents:
User-agent: Googlebot
Disallow: /private/
User-agent: *
Disallow: /admin/
What is the difference between Allow and Disallow?
**Disallow:**
• Tells bots NOT to crawl specified paths
• More commonly used
• Example: Disallow: /admin/ (blocks all admin pages)
**Allow:**
• Explicitly permits access to specified paths
• Used to override broader Disallow rules
• Creates exceptions to blocked sections
Example use case:
User-agent: *
Disallow: /private/
Allow: /private/blog/
This blocks /private/ directory but allows /private/blog/ to be crawled.
**Important notes:**
• Allow takes precedence over Disallow for the same path
• More specific paths override general paths
• An empty Disallow means allow everything
• Order matters - more specific rules first
Should I include my sitemap in robots.txt?
Yes, absolutely! Including your sitemap URL in robots.txt is a best practice:
Sitemap: https://yoursite.com/sitemap.xml
**Benefits:**
• Helps search engines discover all your pages
• Improves crawl efficiency
• Ensures new content is found quickly
• Works alongside sitemap submission in Search Console
• Can include multiple sitemaps if needed
**You can list multiple sitemaps:**
Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/sitemap-images.xml
Sitemap: https://yoursite.com/sitemap-news.xml
This is advisory - search engines will still crawl your site even without a sitemap, but including it improves indexing efficiency.
How do I test my robots.txt?
**Testing Methods:**
1. **Manual Testing:**
• Visit https://yoursite.com/robots.txt directly
• Verify it loads correctly
• Check for syntax errors
2. **Google Search Console:**
• Navigate to Coverage > robots.txt Tester
• Enter URLs to test against your rules
• See which paths are blocked/allowed
• Submit for indexing after verification
3. **Bing Webmaster Tools:**
• Similar testing functionality
• Verify Bingbot access
4. **Online Validators:**
• Use third-party robots.txt validators
• Check syntax and logic
5. **This Tool:**
• Use the built-in URL tester
• Test specific paths against rules
• Verify bot-specific behavior
**Testing Best Practices:**
• Test critical pages first
• Verify both blocked and allowed paths
• Test with different user-agents
• Monitor crawl stats after deployment
• Regular audits (quarterly recommended)

Common robots.txt mistakes to avoid?
**Critical Mistakes:**
1. **Blocking Important Resources:**
✗ Disallow: /css/
✗ Disallow: /js/
✓ These are needed for Google to render pages correctly
2. **Blocking Entire Site:**
✗ User-agent: *
✗ Disallow: /
✓ This blocks everything - only use temporarily
3. **Security Misconception:**
✗ Using robots.txt to hide sensitive data
✓ Robots.txt is PUBLIC - use authentication instead
4. **Syntax Errors:**
✗ Incorrect capitalization (user-agent vs User-agent)
✗ Missing colons or slashes
✗ Spaces in the wrong places
5. **Wrong Location:**
✗ Placing robots.txt in subdirectories
✓ Must be in root: https://site.com/robots.txt
6. **Blocking Canonical Pages:**
✗ Blocking a page that has canonical tags pointing to it
7. **Conflicting Rules:**
✗ Having contradictory Allow/Disallow statements
8. **Not Updating:**
✗ Leaving old development blocks in production
**Prevention:**
• Always test before deployment
• Regular audits
• Document your rules
• Use this generator tool!
Does robots.txt affect SEO rankings?
Robots.txt itself doesn't directly affect rankings, but it impacts SEO in important ways:
**Positive SEO Effects:**
• **Crawl Budget Optimization** - Direct bots to important pages
• **Prevent Duplicate Content** - Block search results, filters, etc.
• **Improve Site Quality** - Keep low-value pages out of index
• **Better Resource Allocation** - Focus crawler on valuable content
**Negative SEO Effects (if misconfigured):**
• Blocking important pages = they won't rank
• Blocking CSS/JS = poor rendering in search results
• Blocking entire site = no visibility
• Blocking sitemap = slower indexing
**Important Notes:**
• Blocked pages can still appear in results (without descriptions)
• Use meta robots tag or noindex for true de-indexing
• Robots.txt affects what's crawled, not what's indexed
• Combine with other SEO tools for best results
**Best Practice:**
Use robots.txt strategically as part of comprehensive SEO strategy, not as standalone solution.
robots.txt Disallow vs noindex and X-Robots-Tag: which truly de-indexes a page?
These solve different problems, and confusing them is the most common SEO mistake professionals make.
**robots.txt Disallow** only controls CRAWLING. It tells a bot not to fetch a URL. It does NOT remove a page from the index. In fact, a Disallowed URL can still appear in Google results (often with no description, labelled "No information is available") if other sites link to it, because Google never crawled it to see a noindex.
**meta noindex** (`<meta name="robots" content="noindex">`) and the **X-Robots-Tag** HTTP response header DO control INDEXING. They tell Google to drop the page from the index.
The critical catch: for Google to SEE a noindex (in the HTML or in the header), the page must NOT be blocked in robots.txt. If you Disallow the URL, the crawler never fetches it and never reads the noindex, so the page can stay indexed.
**Rules of thumb:**
• Want a page gone from search results? Use noindex (meta tag or X-Robots-Tag) and DO NOT block it in robots.txt.
• Want to save crawl budget on worthless URLs (faceted filters, infinite calendars)? Use robots.txt Disallow.
• X-Robots-Tag is ideal for non-HTML files (PDFs, images) where you cannot add a meta tag: send `X-Robots-Tag: noindex` in the HTTP header.
• robots.txt is PUBLIC and advisory - never use it to hide sensitive data; use authentication.
What is the robots.txt size limit and how does rule precedence work?
**Size limit:** Google enforces a maximum robots.txt size of 500 kibibytes (about 512,000 bytes). Content beyond that limit is ignored. Keep your file lean - this generator reports the byte size in the compliance check so you can confirm you are well under the cap.
**Rule precedence (RFC 9309 - longest match wins):** When multiple Allow and Disallow rules match the same URL, the bot does NOT simply use the first or last rule in the file. It selects the rule whose path pattern matches the LARGEST number of characters. Order in the file is irrelevant.
Example:
User-agent: *
Disallow: /folder/
Allow: /folder/public/
For the URL /folder/public/page.html, both rules match. "/folder/public/" (14 chars) is longer than "/folder/" (8 chars), so the Allow wins and the page is crawlable.
**Tie-breaker:** If an Allow and a Disallow have an equally specific (equal-length) match, the Allow wins - the least restrictive rule is applied. The URL tester in this tool implements exactly this longest-match logic, including '*' and '$' wildcards, so its verdict matches how Googlebot actually behaves.
How do the '*' wildcard and '$' anchor work in robots.txt paths?
Google and other major crawlers support two pattern characters in path values (defined in RFC 9309):
**'*' (asterisk)** matches any sequence of characters, including none. Use it to match variable segments or query strings.
• `Disallow: /*?sort=` blocks any URL containing "?sort=" anywhere after the first path segment, e.g. /products?sort=asc.
• `Disallow: /private*/` blocks /private/, /private-data/, etc.
**'$' (dollar sign)** anchors the match to the END of the URL path. Use it to target a specific file type or exact URL.
• `Disallow: /*.pdf$` blocks every URL that ends in .pdf, but not /file.pdf?download=1 (which does not end in .pdf).
• `Allow: /$` allows ONLY the homepage and nothing else.
**Combined example:**
User-agent: *
Disallow: /*?
Allow: /*?id=$
This blocks all query-string URLs except those ending in ?id=. Because the presets in this tool emit wildcard rules like /*?sort= and /*?filter=, the built-in URL tester converts these patterns to real matchers - so testing https://example.com/products?sort=asc correctly reports Blocked.
What is Crawl-delay and should I use it?
Crawl-delay specifies the number of seconds a bot should wait between requests:
User-agent: *
Crawl-delay: 10
**Pros:**
• Prevents server overload
• Controls bandwidth usage
• Useful for slow or shared hosting
• Can limit aggressive bots
**Cons:**
• Can slow down indexing significantly
• Not supported by Googlebot (use Search Console instead)
• May harm SEO if set too high
• Different bots interpret it differently
**Recommendations:**
**Don't use if:**
• You have good hosting/CDN
• You want fast indexing
• Your site is small-medium sized
**Consider using if:**
• Experiencing server issues from bots
• Shared hosting with limited resources
• Very large site with crawl budget concerns
• Targeting specific problematic bots
**Alternatives:**
• Upgrade hosting
• Use CDN
• Optimize site performance
• Configure Google Search Console crawl rate
• Use server-level rate limiting
**Safe Values:**
• 1-5 seconds: Minimal impact
• 10-30 seconds: Moderate slowdown
• 60+ seconds: Significant delay, avoid unless necessary
Key Features
- Easy-to-use interface for creating robots.txt
- Support for all major search engine bots
- Quick presets for common website types (WordPress, E-commerce, Blog)
- Custom rules with Allow/Disallow directives
- Common path suggestions (admin, wp-admin, login, etc.)
- Sitemap URL integration
- Crawl-delay configuration
- Host preference specification
- Real-time preview of generated file
- URL tester with real RFC 9309 longest-match and '*'/'$' wildcard support
- Compliance linter checks size limit, whole-site blocks and CSS/JS blocking
- Copy to clipboard with one click
- Download as robots.txt file
- File size and statistics display
- Syntax validation
- Best practices guide included
- Multiple user-agent support
- 100% free, no registration required
- Works completely in browser - no server upload needed
- Mobile-friendly responsive design
