What is a robots.txt file and how to use it
- Category : SEO
- Posted on : May 21, 2020
- Views : 1,665
- By : HostSEO
Robots.txt - General information
- blocking main WordPress directories
- blocking on the basis of your site structure
- duplicate content issues in WordPress
Robots.txt - General information
Robots.txt is a text file located in a websiteâ$™s root directory that specifies what website pages and files you want (or donâ$™t want) search engine crawlers and spiders to visit. Usually, website owners want to be noticed by search engines; however, there are cases when itâ$™s not needed. For instance, if you store sensitive data or you want to save bandwidth by not indexing (excluding heavy pages with images).
The search engines index the websites using the keywords and metadata in order to provide the most relevant results to the Internet users looking for something online. Reaching the top of the search resultsâ$™ list is especially important for e-commerce shop owners. Customers rarely browse further than the first few pages of the suggested matches in the search engine.
For indexing purposes, so-called spiders or crawlers are used. These are bots that the search engine companies use to fetch and index the content of all the websites that are open to them.
When a crawler accesses a website, it first requests a file named /robots.txt. If such a file is found, the crawler then checks it for the website indexation instructions. The bot that does not find any directives has its own algorithm of actions, which basically indexes everything. Not only does this overload the website with needless requests but also indexing itself becomes a lot less effective.
NOTE: There can be only one robots.txt file for the website. A robots.txt file for an addon domain name needs to be placed in the corresponding document root. For example, if your domain name is www.domain.com, it should be found at https://www.domain.com/robots.txt.
Itâ$™s also very important that your robots.txt file is actually called robots.txt. The name is case sensitive, so make sure to get that right or it wonâ$™t work.
Google's official stance on the robots.txt file
- User-agent name (search engine crawlers). Find the list with all user-agentsâ$™ names here
- .Line(s) starting with the Disallow: directive to block indexing.
Robots.txt has to be created in the UNIX text format. Itâ$™s possible to create such a .txt file directly in the File Manager in cPanel. More detailed instructions can be found here.
Basics of robots.txt syntax
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~different/
In this example three directories: /cgi-bin/, /tmp/ and /~different/ are excluded from indexation.
- Every directory is written on a separate line. You should not write all the directories in one line, nor should you break up one directive into several lines. Instead, use a new line to separate directives from each other.
- Star (*) in the User-agent field means â$œany web crawler.â$ Consequently, directives such as Disallow: *.gif or User-agent: Mozilla* are not supported. Pay attention to these logical mistakes as they are the most common ones.
- Another common mistake is an accidental typo - misspelled directories, user-agents, missing colons after User-agent and Disallow, etc. When your robots.txt files get more and more complicated, itâ$™s easier for an error to slip in so there are some validation tools that come in handy.
Examples of usage
Here are some useful examples of robots.txt usage:
User-agent: * Disallow: /
User-agent: * Disallow:
User-agent: * Disallow: /cgi-bin/
Example 2
User-agent: * Disallow: /page_url
User-agent: * Disallow: /page_url$
User-agent: Bot1 Disallow: /
Despite the list, some identities might change over time. When the load is extremely high on the website, and itâ$™s not possible to find out the exact bot overusing the resources, itâ$™s better to block all of them temporarily.
Allow indexation to a specific web crawler and prevent indexation from others:
User-agent: Opera 9 Disallow: User-agent: * Disallow: /
There is also the Allow: directive. This is not recognized, however, by all the crawlers and might get ignored by a number of them. Currently, itâ$™s supported by Bing and Google. The following rule example of how to allow only one file from a specific folder should be used at your own risk:
User-agent: * Allow: /docs/file.jpeg Disallow: /docs/
Instead, you can move all the files to a certain subdirectory and prevent its indexation except for one file that you allow to be indexed:
User-agent: * Disallow: /docs/
This setup requires a specific website structure. Itâ$™s also possible to create a separate landing page that would redirect to actual user`s home page. This way you can block the actual directory with the website and allow the landing index page only. Itâ$™s better when such changes are performed by a website developer to avoid any issues.
This exclusion means your images will not be indexed and included in Googleâ$™s Image Search. Images appearing in search results is something you would want, as it increases your SEO rankings. However, you need to look out for an issue called â$œhotlinking.â$ When someone reposts an image uploaded to your website elsewhere, your server is what gets loaded with the requests. To prevent hotlinking, read more in our corresponding Knowledgebase article.
Disallow: /images/
If your website has a lot of private content or the media files are not stored permanently, but uploaded and deleted daily, itâ$™s better to exclude the images from search results. In the first case, itâ$™s a matter of personal privacy. The latter regards the possible overload of crawlers activity when they are checking each new image again and again.
If you have a sitemap.xml file (and you should have it as it increases your SEO rankings), it will be good to include the following line in your robots.txt file:
sitemap:http://www.domain.com/sitemap.xml
For guidelines on how to create the sitemap.xml for your website, you may find them here.
Miscellaneous remarks
- Don't block CSS, Javascript and other resource files by default. This prevents Googlebot from properly rendering the page and understanding that your site is mobile-optimized.
- You can also use the file to prevent specific pages from being indexed, like login- or 404-pages, but this is better done using the robots meta tag.
- Adding disallow statements to a robots.txt file does not remove content. It simply blocks access to spiders. If there is content that you want to remove, itâ$™s better to use a meta noindex.
- As a rule, the robots.txt file should never be used to handle duplicate content. There are better ways like a Rel=canonical tag which is a part of the HTML head of a webpage.
- Always keep in mind that robots.txt should be accurate in order your website could be indexed correctly by the search engines.
Including URL indexing to 'noindex'
The noindex meta tag prevents the whole page from being indexed by a search engine. This might not be a desirable situation since you would want the URLs on that page being indexed and followed by bots for better results. To ensure this happening, you can edit your page header with the following line:
If nofollow is added, the crawler will stop when it reaches this page and will not move further to the interlinked content:
From an SEO perspective, this is not recommended but itâ$™s up to you to decide.Simply blocking expired content is not effective. Instead, the 301 redirects should be applied either in the .htaccess file or via plugins. If there is no adequate replacement for the removed page it may be redirected to the homepage.
- Login pages
- Administration area
- Personal accounts information
- The fact that this URL appears in the search results does not mean that anyone without the credentials can access it. Still, you may want to have a custom administrative dashboard and login URLs that are only known to you.
- Itâ$™s recommended to not only exclude certain folders but also protect them using passwords.
- If certain content on your website should be available to registered users only, make sure to apply these settings to the pages. The password-only access can be set up as described here. The examples are the websites with premium membership where certain pages and articles are available upon being logged in only.
- The robots.txt file and its content can be checked online. This is why itâ$™s advised to avoid inputting any names or data that might give unwanted information about your business.
User-agent: * Disallow: /profiles/
crawl-delay: 10
In this case, thereâ$™s a 10-second delay for search bots.
Robots.txt for WordPress
WordPress creates a virtual robots.txt file once you publish your first post with WordPress. Though if you already have a real robots.txt file created on your server, WordPress wonâ$™t add a virtual one.
A virtual robots.txt doesnâ$™t exist on the server, and you can only access it via the following link: http://www.yoursite.com/robots.txt
By default, it will have Googleâ$™s Mediabot allowed, a bunch of spambots disallowed and some standard WordPress folders and files disallowed.
So in case you didnâ$™t create a real robots.txt yet, create one with any text editor and upload it to the root directory of your server via FTP. As best practice, you can also use one of the many offered SEO plugins. For the most updated and trustworthy plugins, check out WordPressâ$™ official SEO guide.
Blocking main WordPress directories
There are 3 standard directories in every WordPress installation â$“ wp-content, wp-admin, wp-includes that donâ$™t need to be indexed.
Donâ$™t choose to disallow the whole wp-content folder though, as it contains an 'uploads' subfolder with your siteâ$™s media files that you donâ$™t want to be blocked. Thatâ$™s why you need to proceed as follows:
Disallow: /wp-admin/ Disallow: /wp-includes/ Disallow: /wp-content/plugins/ Disallow: /wp-content/themes/
Blocking on the basis of your site structure
Every blog can be structured in various ways:
 a) On the basis of categories
 b) On the basis of tags
 c) On the basis of both - none of those
 d) On the basis of date-based archives
a) If your site is category-structured, you donâ$™t need to have the Tag archives indexed. Find your tag base in the Permalinks options page under the Settings menu. If the field is left blank, the tag base is simply 'tag':
Disallow: /tag/
b) If your site is tag-structured, you need to block the category archives. Find your category base and use the following directive:
Disallow: /category/
c) If you use both categories and tags, you donâ$™t need to use any directives. In case you use none of them, you need to block both of them:
Disallow: /tags/ Disallow: /category/
d) If your site is structured on the basis of date-based archives, you can block those in the following ways:
Disallow: /2010/ Disallow: /2011/ Disallow: /2012/ Disallow: /2013/
PLEASE NOTE: You canâ$™t use Disallow: /20*/ here as such a directive will block every single blog post or page that starts with the number '20'.
Duplicate content issues in WordPress
By default, WordPress has duplicate pages which do no good to your SEO rankings. To repair it, we would advise you not to use robots.txt, but instead go with a subtler way: the rel = canonical tag that you use to place the only correct canonical URL in the section of your site. This way, web crawlers will only crawl the canonical version of a page. A more detailed description from Google about what a canonical tag is and why you should be using it can be found here.
That's it!
                   Need any help? Contact our Helpdesk
Categories
- cPanel Question 47
- cPanel Software Management 29
- cPanel Tutorials 13
- Development 29
- Domain 13
- General 19
- Linux Helpline (Easy Guide) 156
- Marketing 47
- MySQL Question 13
- News 2
- PHP Configuration 14
- SEO 4
- SEO 42
- Server Administration 84
- SSL Installation 54
- Tips and Tricks 24
- VPS 3
- Web Hosting 44
- Website Security 22
- WHM questions 13
- WordPress 148
Subscribe Now
10,000 successful online businessmen like to have our content directly delivered to their inbox. Subscribe to our newsletter!Archive Calendar
Sat | Sun | Mon | Tue | Wed | Thu | Fri |
---|---|---|---|---|---|---|
1 | 2 | 3 | ||||
4 | 5 | 6 | 7 | 8 | 9 | 10 |
11 | 12 | 13 | 14 | 15 | 16 | 17 |
18 | 19 | 20 | 21 | 22 | 23 | 24 |
25 | 26 | 27 | 28 | 29 | 30 | 31 |
Recent Articles
-
Posted on : Sep 17
-
Posted on : Sep 10
-
Posted on : Aug 04
-
Posted on : Apr 01
Tags
- ts
- myisam
- vpn
- sql
- process
- kill
- tweak
- server load
- attack
- ddos mitigation
- Knowledge
- layer 7
- ddos
- webmail
- DMARC
- Development
- nginx
- seo vpn
- Hosting Security
- wireguard
- innodb
- exim
- smtp relay
- smtp
- VPS Hosting
- cpulimit
- Plesk
- Comparison
- cpu
- encryption
- WHM
- xampp
- sysstat
- optimize
- cheap vpn
- php-fpm
- mariadb
- apache
- Small Business
- Error
- Networking
- VPS
- SSD Hosting
- Link Building
- centos
- DNS
- optimization
- ubuntu