{"id":48877,"date":"2025-08-12T15:24:58","date_gmt":"2025-08-12T19:24:58","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=48877"},"modified":"2025-08-12T15:24:58","modified_gmt":"2025-08-12T19:24:58","slug":"reddit-will-block-the-internet-archive","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2025\/08\/12\/reddit-will-block-the-internet-archive\/","title":{"rendered":"Reddit Will Block the Internet Archive"},"content":{"rendered":"<p><a href=\"https:\/\/www.theverge.com\/news\/757538\/reddit-internet-archive-wayback-machine-block-limit\">Jay Peters<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.theverge.com\/news\/757538\/reddit-internet-archive-wayback-machine-block-limit\">\n<p>Reddit says that it has caught AI companies scraping its data from the Internet Archive&rsquo;s Wayback Machine, so it&rsquo;s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/pxlnv.com\/linklog\/reddit-blocks-internet-archive\/\">Nick Heer<\/a>:<\/p>\n<blockquote cite=\"https:\/\/pxlnv.com\/linklog\/reddit-blocks-internet-archive\/\">\n<p>Unfortunately for many publishers, the Archive seems to be <a href=\"https:\/\/archive.org\/robots.txt\">perfectly happy<\/a> with scrapers and is <a href=\"https:\/\/blog.archive.org\/2023\/04\/28\/internet-archive-weighs-in-on-artificial-intelligence-at-the-copyright-office\/\">unbothered<\/a> if its collection is used to train artificial intelligence.<\/p>\n<\/blockquote>\n\n<p>Why doesn&rsquo;t the Internet Archive repeat the crawling policy of the original site? Otherwise, it essentially becomes a Napster for data laundering.<\/p>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/11\/04\/reddit-is-finally-profitable\/\">Reddit Is Finally Profitable<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/07\/25\/only-google-can-crawl-reddit\/\">Only Google Can Crawl Reddit<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/24\/ai-companies-ignoring-robots-txt\/\">AI Companies Ignoring Robots.txt<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/02\/27\/reddit-ai-training-data-and-ipo\/\">Reddit AI Training Data and IPO<\/a><\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Jay Peters: Reddit says that it has caught AI companies scraping its data from the Internet Archive&rsquo;s Wayback Machine, so it&rsquo;s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2025-08-12T19:25:00Z","apple_news_api_id":"ea069b84-cb77-48cb-a7d2-f5b21b96d27f","apple_news_api_modified_at":"2025-08-12T19:25:00Z","apple_news_api_revision":"AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/w==","apple_news_api_share_url":"https:\/\/apple.news\/A6gabhMt3SMun0vWyG5bSfw","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[1351,1127,1366,96,2612],"class_list":["post-48877","post","type-post","status-publish","format-standard","hentry","category-technology","tag-artificial-intelligence","tag-internet-archive","tag-reddit","tag-web","tag-web-crawlers"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/48877","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=48877"}],"version-history":[{"count":1,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/48877\/revisions"}],"predecessor-version":[{"id":48878,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/48877\/revisions\/48878"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=48877"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=48877"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=48877"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}