Tuesday, August 12, 2025

Reddit Will Block the Internet Archive

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.

Nick Heer:

Unfortunately for many publishers, the Archive seems to be perfectly happy with scrapers and is unbothered if its collection is used to train artificial intelligence.

Why doesn’t the Internet Archive repeat the crawling policy of the original site? Otherwise, it essentially becomes a Napster for data laundering.

Previously:

Artificial Intelligence Internet Archive Reddit Web Web Crawlers

5 Comments RSS · Twitter · Mastodon

bart

August 12, 2025 5:27 PM

I wondered the same thing about their crawling policy. They really don't care? They just have money to burn letting AI crawlers light up their servers?

Also, Huffman's reign of terror/path of destruction continues.

Will Richardson

August 12, 2025 8:35 PM

> repeat the crawling policy of the original site

I don't think this is possible, robots.txt is site-wide, so you'd have to generate a humongous file that contains all the policies of every archived website. Since crawlers may disregard robots.txt, you'd somehow have to know what server-side blocking rules the source website would enforce—they might be blocking IP ranges, etc—and this is unknowable by IA.

Léo Natan

August 12, 2025 9:51 PM

If Internet Archive was on the side of its readers, they’d block scrapers altogether, so it remains on the good side of every content site. But I fear they care more about ad/“AI” money.

Michael Tsai

August 12, 2025 10:00 PM

@Will Yeah, perhaps robots.txt is too limited, but they could at least list major sites or use subdomains or just opt out for everything. I don’t see what the point is of default-allowing crawling, especially for sites that still exist and that they want to cooperate. Unless they are purposely trying to encourage the data to be copied far and wide.

Manx

August 12, 2025 10:18 PM

I donate monthly to the Internet Archive and I think these questions are great questions. There's no reason the internet archive should enable data laundering or change the scraping policy of the original site. I'm going to contact them about this and I suggest other people do the same

Reddit Will Block the Internet Archive

5 Comments RSS · Twitter · Mastodon

Leave a Comment