Wednesday, July 24, 2019

Google Open Sources robots.txt Parser

Google (via Hacker News):

We’re here to help: we open sourced the C++ library that our production systems use for parsing and matching rules in robots.txt files. This library has been around for 20 years and it contains pieces of code that were written in the 90’s. Since then, the library evolved; we learned a lot about how webmasters write robots.txt files and corner cases that we had to cover for, and added what we learned over the years also to the internet draft when it made sense.

We also included a testing tool in the open source package to help you test a few rules.

My sites have recently been hammered by bots—hundreds of thousands of hits for search engines and directories I’d never heard of—causing the server to run out of memory (I think due to the PHP-based vBulletin forum) and reboot. If you’ve seen this site go down for a couple minutes every now and then, I think that’s why.

The bots all claimed to follow the Robots Exclusion Protocol, but they were not respecting my requests to crawl more slowly and to avoid the forum. Eventually I figured out that the specification calls for lines to be separated by CR LF, but my robots.txt files were only using CR.

bhartzer:

Google has been very clear lately (via John Mueller) regarding getting pages indexed or removed from the index.

If you want to make sure a URL is not in their index then you have to ‘allow’ them to crawl the page in robots.txt and use a noindex meta tag on the page to stop indexing. Simply disallowing the page from being crawled in robots.txt will not keep it out of the index.

Previously:

Comments RSS · Twitter

Leave a Comment