What’s the first thing a search engine looks for upon arriving at your site? It’s robots.txt, a text file kept in the root directory of a website that instructs spiders which directories can and cannot be crawled.
With simple “disallow” commands, a robots.txt is where you can block indexing of:
- Private directories you don’t want the public to find
- Temporary or auto-generated pages (such as search results pages)
- Advertisements you may host (such as AdSense ads)
- Under-construction sections of your site
Every site should put a robots.txt file in their root directory, even if it’s blank, since that’s the first thing on the spiders’ checklist.
But handle your robots.txt with great care; it’s like a small rudder capable of steering a huge ship. A single disallow command applied to the root directory can stop all crawling — which is very useful, for instance, for a staging site or a brand new version of your site that isn’t ready for prime time yet. However, we’ve seen entire websites inadvertently sink without a trace in the SERPs simply because the webmaster forgot to remove that disallow command when the site went live.
SEO Tip: Some content management systems (e.g., WordPress) come with a prefabricated robots.txt file. Make sure that you update it to meet your site’s needs.
Google offers a robots.txt Tester in Google Search Console that checks your robots.txt file to make sure it’s working as you desire. Secondly, we suggest running the Fetch as Google tool if there’s any question about how a particular URL may be indexed. This tool simulates how Google crawls URLs on your website, even rendering your pages to show you whether the spiders can correctly process the various types of code and elements you have on your page.