Wednesday, April 26, 2017

The Internet Archive and Robots.txt

Mark Graham:

Over time we have observed that the robots.txt files that are geared toward search engine crawlers do not necessarily serve our archival purposes. Internet Archive’s goal is to create complete “snapshots” of web pages, including the duplicate content and the large versions of files. We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine. In other words, a site goes out of business and then the parked domain is “blocked” from search engines and no one can look at the history of that site in the Wayback Machine anymore.

[…]

We see the future of web archiving relying less on robots.txt file declarations geared toward search engines, and more on representing the web as it really was, and is, from a user’s perspective.

Nick Heer:

I get where Graham is coming from here. The Internet Archive is supposed to be a snapshot of the web as it was at any given time, and if a robots.txt file prevents them from capturing a page or a section of a website that would normally be visible to a user, that impairs their mission.

But, much as I love the Internet Archive, I think Summers’ criticism is entirely valid: ignoring robots.txt files would violate website publishers’ wishes.

Comments RSS · Twitter

Leave a Comment