One day after Google announced they were submitting a Robots Exclusion Protocol draft to become formalized as an international standard, they announced another surprise for webmasters. Gary Illyes, a software engineer at Google, wrote that Google was dropping support for
noindex in robots.txt.
It doesn’t appear that these decisions were made frivolously. The rules that were removed have other methods to accomplish the same results or have inherent flaws in regards to how crawlers work. Noindex provides the best example of a flawed rule, and Google highlights that in its Remove information from Google help article.
If you use a robots.txt file on your website, you can tell Google not crawl a page. However, if Google finds a link to your page on another site, with descriptive text, we might generate a search result from that. If you have included a “noindex” tag on the page, Google won’t see it, because Google must crawl (fetch) the page in order to see that tag, but Google won’t fetch your page if there’s a robots.txt file blocking it! Therefore, you should let Google crawl the page and see the “noindex” tag or header. It sounds counterintuitive, but you need to let Google try to fetch the page and fail (because of password protection) or see the “noindex” tag to ensure it’s omitted from search results.
Crawl-delay is another rule that has been used extensively in robots.txt, and it helps webmasters manage aggressive crawlers from overwhelming their sites. Bing used to be particularly aggressive with specific sites and in some circumstances, would cause them to become inaccessible. The only way a webmaster could control Bing in the past was to use the crawl-delay rule. Today, Bing is less aggressive, and using
crawl-delay is no longer necessary.
As Illyes suggested in the announcement, webmasters should find other ways to accomplish the desired outcome. He didn’t include any examples on how to address aggressive crawlers, but webmasters do have options besides the use of
crawl-delay. For example, Google has a choice in its Search Console settings that will let you choose for Google to limit its maximum crawl rate. However, that option is only available in their old Search Console, and it’s uncertain whether or not they’ll port it over to the new Search Console.
crawl-delay doesn’t seem necessary for major search engines anymore, another solution is to disallow any crawler that’s not a major search engine. Other solutions include using a CDN or a service like Cloudflare that can manage and block aggressive crawlers for you.
These changes will ultimately be good for the webmaster community by creating an international standard for robots.txt. Aside from reestablishing best practices, it will also provide an extensible architecture to crawlers for rules that are not part of the standard.