What is a robots.txt file and how to use it

Category : SEO
Posted on : May 21, 2020
Views : 2,006
By : HostSEO

Robots.txt - General information

basics of robots.txt syntax
examples of usage

Robots.txt and SEO

removing exclusions of images
adding reference to your sitemap.xml file
miscellaneous remarks

Hotfixes and workarounds

noindex
removed content
security leaks
system and website configurations content
bot overload

Robots.txt for WordPress

blocking main WordPress directories
blocking on the basis of your site structure
duplicate content issues in WordPress

Robots.txt - General information

Robots.txt is a text file located in a websiteâ$™s root directory that specifies what website pages and files you want (or donâ$™t want) search engine crawlers and spiders to visit. Usually, website owners want to be noticed by search engines; however, there are cases when itâ$™s not needed. For instance, if you store sensitive data or you want to save bandwidth by not indexing (excluding heavy pages with images).

The search engines index the websites using the keywords and metadata in order to provide the most relevant results to the Internet users looking for something online. Reaching the top of the search resultsâ$™ list is especially important for e-commerce shop owners. Customers rarely browse further than the first few pages of the suggested matches in the search engine.
For indexing purposes, so-called spiders or crawlers are used. These are bots that the search engine companies use to fetch and index the content of all the websites that are open to them.

When a crawler accesses a website, it first requests a file named /robots.txt. If such a file is found, the crawler then checks it for the website indexation instructions. The bot that does not find any directives has its own algorithm of actions, which basically indexes everything. Not only does this overload the website with needless requests but also indexing itself becomes a lot less effective.

NOTE: There can be only one robots.txt file for the website. A robots.txt file for an addon domain name needs to be placed in the corresponding document root. For example, if your domain name is www.domain.com, it should be found at https://www.domain.com/robots.txt.
Itâ$™s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so make sure to get that right or it wonâ$™t work.

Google's official stance on the robots.txt file

A robots.txt file consists of lines which contain two fields:

User-agent name (search engine crawlers). Find the list with all user-agentsâ$™ names here
.Line(s) starting with the Disallow: directive to block indexing.

Robots.txt has to be created in the UNIX text format. Itâ$™s possible to create such a .txt file directly in the File Manager in cPanel. More detailed instructions can be found here.

Basics of robots.txt syntax

Usually, a robots.txt file contains a code like this:

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~different/

In this example three directories: /cgi-bin/, /tmp/ and /~different/ are excluded from indexation.

PLEASE NOTE:

Every directory is written on a separate line. You should not write all the directories in one line, nor should you break up one directive into several lines. Instead, use a new line to separate directives from each other.
Star (*) in the User-agent field means â$œany web crawler.â$ Consequently, directives such as Disallow: *.gif or User-agent: Mozilla* are not supported. Pay attention to these logical mistakes as they are the most common ones.
Another common mistake is an accidental typo - misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, itâ$™s easier for an error to slip in so there are some validation tools that come in handy.

Examples of usage

Here are some useful examples of robots.txt usage:

Example 1

Prevent the whole site from indexation by all web crawlers:

User-agent: *
Disallow: /

Such a measure as fully blocking the crawling might be needed when the website is under a heavy load of requests or if the content is being updated and should not come up in the search results. Sometimes the settings for the SEO-campaign are too aggressive so the bots basically overload the website with the requests to its pages.

Example 2

Allow all web crawlers to index the whole site:

User-agent: *
Disallow:

There is actually no need to crawl the whole website. Itâ$™s unlikely that the visitors will be looking up terms of use or login pages via Google Search, for example. Excluding some pages or types of content from indexing would be beneficial for security, speed, and relevance in the rankings of the given website.

Below are examples on how to control what content is indexed on your website.

Example 1

Prevent only several directories from indexation:

User-agent: *
Disallow: /cgi-bin/

Example 2

Prevent the siteâ$™s indexation by a specific web crawler:

User-agent: *
Disallow: /page_url

The page usually goes without a full URL, only by its name that follows http://www.yourdomain.com/. When such a rule is used, any page with matching name is blocked from indexing. For example, both /page_url and /page_url_new will be excluded. In order to avoid this, the following code can be used:

User-agent: *
Disallow: /page_url$

Example 3

Prevent the websiteâ$™s indexation by a specific web crawler.:

User-agent: Bot1
Disallow: /

Despite the list, some identities might change over time. When the load is extremely high on the website, and itâ$™s not possible to find out the exact bot overusing the resources, itâ$™s better to block all of them temporarily.

Example 4

Allow indexation to a specific web crawler and prevent indexation from others:

User-agent: Opera 9
Disallow:
User-agent: *
Disallow: /

Example 5

Prevent all the files from indexation except a single one.

There is also the Allow: directive. This is not recognized, however, by all the crawlers and might get ignored by a number of them. Currently, itâ$™s supported by Bing and Google. The following rule example of how to allow only one file from a specific folder should be used at your own risk:

User-agent: *
Allow: /docs/file.jpeg
Disallow: /docs/

Instead, you can move all the files to a certain subdirectory and prevent its indexation except for one file that you allow to be indexed:

User-agent: *
Disallow: /docs/

This setup requires a specific website structure. Itâ$™s also possible to create a separate landing page that would redirect to actual user`s home page. This way you can block the actual directory with the website and allow the landing index page only. Itâ$™s better when such changes are performed by a website developer to avoid any issues.

You can also use an online robots.txt file generator here. Keep in mind that it performs the default setup that does not take into account the sophisticated structures of the custom-coded websites.

Robots.txt and SEO

Removing exclusions of images

The default robots.txt file in some CMS versions is set up to exclude your images folder. This issue doesnâ$™t occur in the latest CMS versions, but the older versions need to be checked.
This exclusion means your images will not be indexed and included in Googleâ$™s Image Search. Images appearing in search results is something you would want, as it increases your SEO rankings. However, you need to look out for an issue called â$œhotlinking.â$ When someone reposts an image uploaded to your website elsewhere, your server is what gets loaded with the requests. To prevent hotlinking, read more in our corresponding Knowledgebase article.

If you would like to change this, open your robots.txt file and remove the line that says:

Disallow: /images/

If your website has a lot of private content or the media files are not stored permanently, but uploaded and deleted daily, itâ$™s better to exclude the images from search results. In the first case, itâ$™s a matter of personal privacy. The latter regards the possible overload of crawlers activity when they are checking each new image again and again.

Adding reference to your sitemap.xml file

If you have a sitemap.xml file (and you should have it as it increases your SEO rankings), it will be good to include the following line in your robots.txt file:

sitemap:http://www.domain.com/sitemap.xml

Do not forget to replace the http://www.domain.com/sitemap.xml path with your actual information.
For guidelines on how to create the sitemap.xml for your website, you may find them here.

Miscellaneous remarks

Don't block CSS, Javascript and other resource files by default. This prevents Googlebot from properly rendering the page and understanding that your site is mobile-optimized.
You can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta tag.
Adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. If there is content that you want to remove, itâ$™s better to use a meta noindex.
As a rule, the robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag which is a part of the HTML head of a webpage.
Always keep in mind that robots.txt should be accurate in order your website could be indexed correctly by the search engines.

Hotfixes and workarounds

Including URL indexing to 'noindex'

The noindex meta tag prevents the whole page from being indexed by a search engine. This might not be a desirable situation since you would want the URLs on that page being indexed and followed by bots for better results. To ensure this happening, you can edit your page header with the following line:

This line will prevent the page itself from being indexed by a search engine but due to the follow part in the code, the links posted on this page will still be retrieved. This will allow the spider to move around the website and its linked content. The benefit from this type of integration is called Link Juice - itâ$™s the connection between different pages and the relevance of their content to each other.
If nofollow is added, the crawler will stop when it reaches this page and will not move further to the interlinked content:

From an SEO perspective, this is not recommended but itâ$™s up to you to decide.

Removed content

Some pages might be removed from the website permanently, therefore, no longer having any real value. Any outdated content should be removed from the robots.txt and .htaccess files. The latter might contain the redirects for the pages that are no longer relevant.
Simply blocking expired content is not effective. Instead, the 301 redirects should be applied either in the .htaccess file or via plugins. If there is no adequate replacement for the removed page it may be redirected to the homepage.

Security Leaks

Itâ$™s better to prohibit indexed pages with sensitive data on them. The most common examples are:

Login pages
Administration area
Personal accounts information

To improve website security, please keep in mind the following:

The fact that this URL appears in the search results does not mean that anyone without the credentials can access it. Still, you may want to have a custom administrative dashboard and login URLs that are only known to you.
Itâ$™s recommended to not only exclude certain folders but also protect them using passwords.
If certain content on your website should be available to registered users only, make sure to apply these settings to the pages. The password-only access can be set up as described here. The examples are the websites with premium membership where certain pages and articles are available upon being logged in only.
The robots.txt file and its content can be checked online. This is why itâ$™s advised to avoid inputting any names or data that might give unwanted information about your business.

For example, if you have pages for your colleagues that each reside in separate folders and you want to exclude them from search results, they should not be named "johndoe" or "janedoe", etc. Disallowing these aforementioned folder names will basically openly publicize your colleaguesâ$™ names. Instead, you can create folder â$œprofilesâ$ and place all the personal accounts there. The URL in the browser would be https://yourdomain.com/profiles/johndoe and the robots.txt rule will look like this:

User-agent: *
Disallow: /profiles/

System and website configuration content

Not only as a security measure, but also in order to save your hosting space resources, you might want to exclude the irrelevant content for your website visitors from search results. For example, these might be the theme and background images, buttons, seasonal banners, etc. Using the Disallow directive for a whole /theme directory is not advised.

This is why itâ$™s advised to fully implement the theme and layout throughout the CSS instead of inserting backgrounds via HTML tag, for example. Hiding the specific style folder might cause an issue with fetching the content by crawlers and properly presenting it to the users in the respective search results.

Bot overload

Some search engines are too eager to check for content with the slightest update. They do it too often and create a heavy load on the website. Nobody wants to see their pages loading slowly because of hungry crawlers, but blocking them completely every time might be too extreme. Instead, itâ$™s possible to slow them down by using the following directive:

crawl-delay: 10

In this case, thereâ$™s a 10-second delay for search bots.

Robots.txt for WordPress

WordPress creates a virtual robots.txt file once you publish your first post with WordPress. Though if you already have a real robots.txt file created on your server, WordPress wonâ$™t add a virtual one.

A virtual robots.txt doesnâ$™t exist on the server, and you can only access it via the following link: http://www.yoursite.com/robots.txt

By default, it will have Googleâ$™s Mediabot allowed, a bunch of spambots disallowed and some standard WordPress folders and files disallowed.

So in case you didnâ$™t create a real robots.txt yet, create one with any text editor and upload it to the root directory of your server via FTP. As best practice, you can also use one of the many offered SEO plugins. For the most updated and trustworthy plugins, check out WordPressâ$™ official SEO guide.

Blocking main WordPress directories

There are 3 standard directories in every WordPress installation â$“ wp-content, wp-admin, wp-includes that donâ$™t need to be indexed.

Donâ$™t choose to disallow the whole wp-content folder though, as it contains an 'uploads' subfolder with your siteâ$™s media files that you donâ$™t want to be blocked. Thatâ$™s why you need to proceed as follows:

Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/themes/

Blocking on the basis of your site structure

Every blog can be structured in various ways:

Â a) On the basis of categories
Â b) On the basis of tags
Â c) On the basis of both - none of those
Â d) On the basis of date-based archives

a) If your site is category-structured, you donâ$™t need to have the Tag archives indexed. Find your tag base in the Permalinks options page under the Settings menu. If the field is left blank, the tag base is simply 'tag':

Disallow: /tag/

b) If your site is tag-structured, you need to block the category archives. Find your category base and use the following directive:

Disallow: /category/

c) If you use both categories and tags, you donâ$™t need to use any directives. In case you use none of them, you need to block both of them:

Disallow: /tags/
Disallow: /category/

d) If your site is structured on the basis of date-based archives, you can block those in the following ways:

Disallow: /2010/
Disallow: /2011/
Disallow: /2012/
Disallow: /2013/

PLEASE NOTE: You canâ$™t use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number '20'.

Duplicate content issues in WordPress

By default, WordPress has duplicate pages which do no good to your SEO rankings. To repair it, we would advise you not to use robots.txt, but instead go with a subtler way: the rel = canonical tag that you use to place the only correct canonical URL in the section of your site. This way, web crawlers will only crawl the canonical version of a page. A more detailed description from Google about what a canonical tag is and why you should be using it can be found here.

That's it!

Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Need any help? Contact our Helpdesk

Previous Post
How to get your website indexed by Google

Next Post
How to use sitemap

Sat	Sun	Mon	Tue	Wed	Thu	Fri
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

HostSEO Blog