One of the most overlooked files on a website, is the robots.txt file - and yet, this file is arguably the most powerful tool your website can have!
This post looks to grasp a basic understanding of a robots.txt file, and is one of the many items we’d look at when completing an SEO audit of a client’s website.
In basic terms a robots.txt file informs robots, including Googlebot, where they can and can’t go on a website.
This can be useful to ensure private areas are not indexed in search results, but it can also be used to optimise the crawl budget of your website by ensuring Googlebot only visits pages on your website that you actually want to be regularly crawled.
A very basic example of a robots.txt file would be something like the below:
User-agent: *
Allow: /
Disallow: /admin
The “User-agent” part of the file dictates which type of robot should listen to the information that follows. In this case, the use of an asterisk indicates that the information is relevant for any and all type of robot visiting the website.
The “Allow” part of the file tells the robot where it’s permitted to go. Often the “allow” part of a robots.txt file is ignored, and instead “disallow” rules are simply used to filter where a robot can visit. In this case, the “allow” line in our robots.txt allows all pages to be visible to all types of robot.
The “Disallow” part of the file, however, shows which files are not permitted to be accessed. In this case, the “/admin” line indicates that this URL should not be accessible - presumably, an admin area that would not be accessible to the public anyway.
Note that using the “Disallow” rule in a robots.txt file is not a secure method of hiding a page from end users. A robots.txt file can be seen by anyone, and the pages even if “disallowed” are still accessible by a regular user. Hide pages properly behind login areas, for example, if your content is intended to be private.
It’s best practice to also include the location of your website’s XML sitemap, assuming you have one, in the robots.txt file.
This means our robots.txt example file would now look as follows:
User-agent: *
Allow: /
Disallow: /admin
Sitemap: https://www.yourwebsite.com/sitemap.xml
Remember that if you have a multilingual website, for example, you cater for both English and French speakers in different sections of your site, you should list each available sitemap in the robots.txt file accordingly.
Often when working with SEO for ecommerce websites, we will find pages which be sorted by price and filtered by colour, for example. Or, functionality will exist where products can be added to a wishlist or comparison tool, which can all create long, complicated, and possibly unnecessary, URLs.
This is not just the case with ecommerce websites, but most websites, built on any type of popular CMS. Every type of website will have their own crawl issues which can be easily amended with the use of robots.txt.
Many of the reports in Google Search Console will flag up potential URLs that are slowing down Google’s crawl of the website. Or, try using a “site:” search in Google to spot URLs which are indexed but probably shouldn’t be.
An example of a bad URL could be when adding a product to a wishlist function on an ecommerce website, whereby the URL might end up looking like:
https://www.yourwebsite.com/product-category/product-name?add-to-wishlist=yes
Inevitably, this will result in every product creating duplicate URLs, and hundreds if not thousands of URLs that simply waste Googlebot’s time. These are of no use for Google to see.
To rid Googlebot of seeing these URLs, our robots.txt file might now look something like the below:
User-agent: *
Allow: /
Disallow: /admin
Disallow: *wishlist*
Sitemap: https://www.yourwebsite.com/sitemap.xml
The new “Disallow: *wishlist*” rule will simply block access to any URL with “wishlist” in the address.
How you set up your robots.txt file depends on the structure of the rest of your website. For example, you wouldn’t want to block URLs in this format if the word “wishlist” is a genuine part of the URL structure on other pages!
You can now start to see how the use of a robots.txt file can really affect the technical SEO performance of your website!
When it comes to making changes to a robots.txt file, it’s important to test your changes before putting anything live. This file is a powerful tool, and can make a mess of your rankings if implemented incorrectly!
For example, a rogue “Disallow: /” line in your robots.txt file would stop Googlebot from crawling your website, not ideal for keyword positions!
So, what tools are there for testing your robots.txt file?
Both of these tools allow you to amend your robots.txt code on the fly and test validation against specific URLs.
One of the most commonly used content management systems available, Wordpress robots.txt files are easy to edit using the popular Yoast SEO plugin.
Every content management system is different, but editing a robots.txt file should be possible through most popular CMS’.
If this isn’t possible, speak to your web developer about changing your robots.txt file manually instead.