A user agent is a computer program representing a person, for example, a browser in a Web context.
Besides a browser, a user agent could be a bot scraping webpages, a download manager, or another app accessing the Web. Along with each request they make to the server, browsers include a self-identifying User-Agent HTTP header called a user agent (UA) string. This string often identifies the browser, its version number, and its host operating system.
Spam bots, download managers, and some browsers often send a fake UA string to announce themselves as a different client. This is known as user agent spoofing.
A typical user agent string looks like this: "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0".
In practice, robots.txt files indicate whether certain user agents (web-crawling software) can or cannot crawl parts of a website. These crawl instructions are specified by “disallowing” or “allowing” the behavior of certain (or all) user agents.
Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage. Using this syntax in a robots.txt file tells web crawlers to crawl all pages on www.example.com, including the homepage.
If it finds one, the crawler will read that file first before continuing through the page. This is especially common with more nefarious crawlers like malware robots or email address scrapers.
* is a wildcard that represents any sequence of characters $ matches the end of the URL Whenever they come to a site, search engines and other web-crawling robots (like Facebook’s crawler, Face bot) know to look for a robots.txt file.
But, they’ll only look for that file in one specific place: the main directory (typically your root domain or homepage). Even if the robots.txt page did exist at, say, example.com/index/ robots.txt or www.example.com/homepage/ robots.txt, it would not be discovered by user agents and thus the site would be treated as if it had no robots file at all.
If you found you didn’t have a robots.txt file or want to alter yours, creating one is a simple process. Do not use robots.txt to prevent sensitive data (like private user information) from appearing in SERP results.
If you want to block your page from search results, use a different method like password protection or the no index meta directive. Most user agents from the same search engine follow the same rules so there’s no need to specify directives for each of a search engine’s multiple crawlers, but having the ability to do so does allow you to fine-tune how your site content is crawled.
If you change the file and want to update it more quickly than is occurring, you can submit your robots.txt URL to Google. MOZ Pro can identify whether your robots.txt file is blocking our access to your website.
Robots.txt is a plain text file that follows the Robots Exclusion Standard. Each rule blocks (or allows) access for a given crawler to a specified file path in that website.
Don't use a word processor; word processors often save files in a proprietary format and can add unexpected characters, such as curly quotes, which can cause problems for crawlers. Use the robots.txt Tester tool to write or edit robots.txt files for your site.
This tool enables you to test the syntax and behavior against your site. The robots.txt file must be located at the root of the website host to which it applies.
If you're unsure about how to access your website root, or need permissions to do so, contact your web hosting service provider. If you can't access your website root, use an alternative blocking method such as meta tags.
Groups are processed from top to bottom, and a user agent can match only one rule set, which is the first, most-specific rule that matches a given user agent. The default assumption is that a user agent can crawl any page or directory not blocked by a Disallow: rule.
Supports the * wildcard for a path prefix, suffix, or entire string. Supports the * wildcard for a path prefix, suffix, or entire string.
Allow: A directory or page, relative to the root domain, that should be crawled by the user agent just mentioned. Supports the * wildcard for a path prefix, suffix, or entire string.
Please read the full documentation, as the robots.txt syntax has a few tricky parts that are important to learn. Keep in mind that in some situations URLs from the website may still be indexed, even if they haven't been crawled.
Note: this does not match the various Abbot crawlers, which must be named explicitly. Remember that you shouldn't use robots.txt to block access to private content: use proper authentication instead.
Disallow crawling of a single webpage by listing the page after the slash: This implementation hides your pages from search results, but the Mediapartners-Google web crawler can still analyze them to decide what ads to show visitors to your site.
For instance, the sample code blocks any URLs that end with .xls : This document details how Google handles the robots.txt file that allows you to control how Google's website crawlers crawl and index publicly accessible websites.
For 5xx, if the robots.txt is unreachable for more than 30 days, the last cached copy of the robots.txt is used, or if unavailable, Google assumes that there are no crawl restrictions. Google treats unsuccessful requests or incomplete data as a server error.
Updated the definition of “groups” to make it shorter and more to the point. Removed references to the deprecated Ajax Crawling Scheme.
The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis), these guidelines do not need to apply.
The robots.txt file must be in the top-level directory of the host, accessible through the appropriate protocol and port number. Generally accepted protocols for robots.txt are all URI-based, and for Google Search specifically (for example, crawling of websites) are “HTTP” and “HTTPS”.
Google-specific: Google also accepts and follows robots.txt files for FTP sites. FTP-based robots.txt files are accessed via the FTP protocol, using an anonymous login.
It is valid for all files in all subdirectories on the same host, protocol and port number. Standard port numbers (80 for HTTP, 443 for HTTPS, 21 for ftp) are equivalent to their default host names.
Conditional allow: The directives in the robots.txt determine the ability to crawl certain content. Handling HTTP result codes2xx (successful)HTTP result codes that signal success result in a “conditional allow” of crawling.3xx (redirection)Google follows at least five redirect hops as defined by RFC 1945 for HTTP/1.0 and then stops and treats it as a 404.
The request is retried until a non-server-error HTTP result code is obtained. A 503 (Service Unavailable) error results in fairly frequent retrying.
If unavailable, Google assumes that there are no crawl restrictions. To temporarily suspend crawling, it is recommended to serve a 503 HTTP result code.
Unsuccessful requests or incomplete panhandling of a robots.txt file which cannot be fetched due to DNS or networking issues, such as timeouts, invalid responses, reset or hung up connections, and HTTP chunking errors, is treated as a server error. Caching robots.txt content is generally cached for up to 24 hours, but may be cached longer in situations where refreshing the cached version is not possible (for example, due to timeouts or 5xx errors). Google may increase or decrease the cache lifetime based on max-age Cache-Control HTTP headers.
For example, if the resulting document is an HTML page, only valid text lines are taken into account, the rest are discarded without warning or error. An optional Unicode BOM (byte order mark) at the beginning of the robots.txt file is ignored.
Each valid line consists of a field, a colon, and a value. Spaces are optional (but recommended improving readability).
Google currently enforces a size limit of 500 kilobytes (KiB). For example, place excluded material in a separate directory.
Note the optional use of white-space and empty lines to improve readability. Even if there is an entry for a related crawler, it is only valid if it is specifically matching.
The value, if specified, is to be seen relative from the root of the website for which the robots.txt file was fetched (using the same protocol, port number, host and domain names). The path value must start with “/” to designate the root.
More information can be found in the section “URL matching based on path values” below. The disallow directive specifies paths that must not be accessed by the designated crawlers.
The allow directive specifies paths that may be accessed by the designated crawlers. The path value is used as a basis to determine whether a rule applies to a specific URL on a site.
Google, Bing, and other major search engines support a limited form of “wildcards” for path values. Google, Bing, and other major search engines support sitemap, as defined by sitemaps.org.
As non-group-member lines, these are not tied to any specific user -agents and may be followed by all crawlers, provided it is not disallowed. At a group-member level, in particular for allow and disallow directives, the most specific rule based on the length of the entry trumps the less specific (shorter) rule.
Sample situations http://example.com/pagehttp://example.com/folder/pagehttp://example.com/page.htmhttp://example.com/http://example.com/page.htm Google offers two options for testing robots.txt markup: Among other things, we will discuss different ways of creating the file and cover the best practices regarding its directives.
It serves to let search engine bots know which pages on your website should be crawled and which shouldn’t. The search engine bots will crawl your website regardless of whether you have the robots.txt file or not.
However, there are quite a few benefits of having a robots.txt file with proper directives once your WordPress website is finished. Moreover, well-written WordPress robots.txt directives can reduce the negative effects of bad bots by disallowing them access.
Bad bots often ignore these directives so using a good security plugin is highly advised, particularly if your website is experiencing issues caused by bad bots. Finally, it is a common misconception that the robots.txt file can prevent some pages on your website from being indexed.
And, even if a page isn’t crawled, it can still be indexed through external links leading to it. If you want to avoid indexing a specific page, you should use the no index meta tag instead of the directives in the robots.txt file.
In this section, we will cover how to create and edit a robots.txt file, some good practices regarding its content, and how to test it for errors. It has a lot of site optimization tools, including a feature that allows users to create and edit robots.txt files.
Within the page that opens, click on the File editor link near the top. Afterward, you will also see a success message stating that the options have been updated.
Then, in the right-hand section, navigate to your root WordPress directory, often called public_HTML. In the left-hand section of your FTP client (we’re using FileZilla), locate the robots.txt file you previously created and saved on your computer.
If you wish to edit the uploaded robots.txt file afterward, find it in the root WordPress directory, right-click on it and select the View/Edit option. The Disallow directive tells the bot not to access a specific part of your website.
You don’t need to use it as often as Disallow, because bots are given access to your website by default. More precisely, it serves to allow access to a file or subfolder belonging to a disallowed folder.
The Crawl-delay directive is used to prevent server overload due to excessive crawling requests. The use of this directive is highly advised, as it can help you with submitting the XML sitemap you create to Google Search Console or Bing Webmaster Tools.
In the following section, we will show you two example snippets, to illustrate the use of the robots.txt directives that we mentioned above. This example snippet disallows access to the entire /admin/ directory to all bots, except for the /admin/admin-ajax.php file found within.
The example below shows the proper way of writing multiple directives; whether they are of the same or different type, one per row is a must. Additionally, this snippet example lets you reference your sitemap file by stating its absolute URL.
If you opt to use it, make sure to replace the www.example.com part with your actual website URL. After you add the directives that fit your website’s requirements, you should test your WordPress robots.txt file.
By doing so, you are both verifying that there aren’t any syntax errors in the file, and making sure that the appropriate areas of your website have been properly allowed or disallowed. It contains directives for crawlers telling them which parts of your website they should or shouldn’t crawl.
While this file is virtual by default, knowing how to create it on your own can be very useful for your SEO efforts. That’s why we covered various ways of making a physical version of the file and shared instructions on how to edit it.
Moreover, we touched on the main directives a WordPress robots.txt file should contain and how to test that you set them properly. Now that you’ve mastered all this, you can consider how to sort out your site’s other SEO aspects.
Just one character out of place can wreak havoc on your SEO and prevent search engines from accessing important content on your site. Primarily, it lists all the content you want to lock away from search engines like Google.
Directives are rules that you want the declared user -agents to follow. Disallow Use this directive to instruct search engines not to access files and pages that fall under a specific path.
For example, if you wanted to block all search engines from accessing your blog and all its posts, your robots.txt file might look like this: If you fail to define a path after the disallow directive, search engines will ignore it.
For example, if you wanted to prevent search engines from accessing every post on your blog except for one, then your robots.txt file might look like this: Unless you’re careful, disallow and allow directives can easily conflict with one another.
If you’re unfamiliar with sitemaps, they generally include the pages that you want search engines to crawl and index. However, it does tell other search engines like Bing where to find your sitemap, so it’s still good practice.
Google no longer supports this directive, but Bing and Yandex do. If you set a crawl-delay of 5 seconds, then you’re limiting bots to crawl a maximum of 17,280 URLs a day.
If you want to exclude a page or file from search engines, use the meta robots tag or robots HTTP header instead. No follow This is another directive that Never Google officially supported, and was used to instruct search engines not to follow links on pages and files under a specific path.
If you want to no follow all links on a page now, you should use the robots meta tag or robots header. Having a robots.txt file isn’t crucial for a lot of websites, especially small ones.
Note that while Google doesn’t typically index web pages that are blocked in robots.txt, there’s no way to guarantee exclusion from search results using the robots.txt file. Just open a blank .txt document and begin typing directives.
For example, to control crawling behavior on domain.com, the robots.txt file should be accessible at domain.com/ robots.txt. If you want to control crawling on a subdomain like blog.domain.com, then the robots.txt file should be accessible at blog.domain.com/ robots.txt.
For example, if you wanted to prevent search engines from accessing parameterized product category URLs on your site, you could list them out like this: This example blocks search engines from crawling all URLs under the /product/ subfolder that contain a question mark.
Failure to provide specific instructions when setting directives can result in easily-missed mistakes that can have a catastrophic impact on your SEO. For example, let’s assume that you have a multilingual site, and you’re working on a German version that will be available under the /DE/ subdirectory.
If you want to control crawling on a different subdomain, you’ll need a separate robots.txt file. These are mainly for inspiration but if one happens to match your requirements, copy-paste it into a text document, save it as robots.txt and upload it to the appropriate directory.
Robots.txt mistakes can slip through the net fairly easily, so it pays to keep an eye out for issues. To do this, regularly check for issues related to robots.txt in the “Coverage” report in Search Console.
It’s easy to make mistakes that affect other pages and files. This means you have content blocked by robots.txt that isn’t currently indexed in Google.
If this content is important and should be indexed, remove the crawl block in robots.txt. Removing the crawl block when attempting to exclude a page from the search results is crucial.
Remove the crawl block and instead use a meta robots tag or xrobots-tag HTTP header to prevent indexing. This may help to improve the visibility of the content in Google search.
Here are a few frequently asked questions that didn’t fit naturally elsewhere in our guide. He's a internet entrepreneur, who next to founding Coast has invested in and advised several startups.
His main expertise is open source software development and digital marketing. The robots.txt file is one of the main ways of telling a search engine where it can and can’t go on your website.
All major search engines support the basic functionality it offers, but some of them respond to some extra rules which can be useful too. This guide covers all the ways to use robots.txt on your website, but, while it looks simple, any mistakes you make in your robots.txt.
Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus among early search engine spider developers. So they created the humans.txt standard as a way of letting people know who worked on a website, amongst other things.
Before a search engine spiders any page on a domain it hasn’t encountered before, it will open that domain’s robots.txt file, which tells the search engine which URLs on that site it’s allowed to index. Search engines typically cache the contents of the robots.txt, but will usually refresh it several times a day, so changes will be reflected fairly quickly.
Blocking all query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create. If you want to reliably block a page from showing up in the search results, you need to use a meta robots no index tag.
Directives like Allow and Disallow should not be case-sensitive, so it’s up to you whether you write them lowercase or capitalize them. We like to capitalize directives because it makes the file easier (for humans) to read.
They will use a specific spider for their normal index, for their ad programs, for images, for videos, etc. Search engines will always choose the most specific block of directives they can find.
“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards, however, all major search engines do understand it. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.
The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the admin folder. But because only Yandex supports the host directive, we wouldn’t advise you to rely on it, especially as it doesn’t allow you to define a scheme (HTTP or HTTPS) either.
A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. And while these search engines have slightly different ways of reading the directive, the end result is basically the same.
By setting a crawl delay of 10 seconds you’re only allowing these search engines to access 8,640 pages a day. On the other hand, if you get next to no traffic from these search engines, it’s a good way to save some bandwidth.
You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions, and we strongly recommend you do, because search engine webmaster tools programs will give you lots of valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick alternative.
You wouldn’t be the first to accidentally use robots.txt to block your entire site, and to slip into search engine oblivion! In July 2019, Google announced that they were making their robots.txt Parser open source.
That means that, if you really want to get into the nuts and bolts, you can go and see how their code works (and, even use it yourself, or propose modifications to it). Search engines robots are programs that visit your site and follow the links on it to learn about your pages.
The best way to edit it is to log in to your web host via a free FTP client like FileZilla, then edit the file with a text editor like Notepad (Windows) or TextEd it (Mac). If you don’t know how to login to your server via FTP, contact your web hosting company to ask for instructions.
In effect, this will tell all robots and web crawlers that they are not allowed to access or crawl your site. Important: Disallowing all robots on a live website can lead to your site being removed from search engines and can result in a loss of traffic and revenue.
You exclude the files and folders that you don’t want to be accessed, everything else is considered to be allowed. You simply put a separate line for each file or folder that you want to disallow.
The reason for this setting is that Google Search Console used to report an error if it wasn’t able to crawl the admin-ajax.php file. This sitemap should contain a list of all the pages on your site, so it makes it easier for the web crawlers to find them all.
If you want to block your entire site or specific pages from being shown in search engines like Google, then robots.txt is not the best way to do it. Search engines can still index files that are blocked by robots, they just won’t show some useful metadata.
On WordPress, if you go to Settings Reading and check “Discourage search engines from indexing this site” then a no index tag will be added to all your pages. In some cases, you may want to block your entire site from being accessed, both by bots and people.
Keep in mind that robots can ignore your robots.txt file, especially abusive bots like those run by hackers looking for security vulnerabilities. Also, if you are trying to hide a folder from your website, then just putting it in the robots.txt file may not be a smart approach.
If you want to make sure that your robots.txt file is working, you can use Google Search Console to test it. Using it can be useful to block certain areas of your website, or to prevent certain bots from crawling your site.
If you are going to edit your robots.txt file, then be careful because a small mistake can have disastrous consequences. For example, if you misplace a single forward slash then it can block all robots and literally remove all of your search traffic until it gets fixed.
“Crawler” is a generic term for any program (such as a robot or spider) that is used to automatically discover and scan websites by following links from one webpage to another. This table lists information about the common Google crawlers you may see in your referrer logs, and how they should be specified in robots.txt, the robots meta tags, and the X-Robots-Tag HTTP directives.
Some pages use multiple robots meta tags to specify directives for different crawlers, like this: Website owners use the / robots.txt file to give instructions about their site to web robots; this is called The Robots Exclusion Protocol.
The Disallow: / tells the robot that it should not visit any pages on the site. Especially malware robots that scan the web for security vulnerabilities, and email address harvesters used by spammers will pay no attention.
The rest of this page gives an overview of how to use / robots.txt on your server, with some simple recipes. Where to put it The short answer: in the top-level directory of your web server.
When a robot looks for the “/ robots.txt file for URL, it strips the path component from the URL (everything from the first single slash), and puts “/ robots.txt in its place. Usually that is the same place where you put your website's main index.html welcome page.
To exclude all files except one This is currently a bit awkward, as there is no “Allow” field. Making statements based on opinion; back them up with references or personal experience.