{"id":44219,"date":"2024-07-25T14:30:44","date_gmt":"2024-07-25T18:30:44","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=44219"},"modified":"2024-08-08T10:08:10","modified_gmt":"2024-08-08T14:08:10","slug":"only-google-can-crawl-reddit","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2024\/07\/25\/only-google-can-crawl-reddit\/","title":{"rendered":"Only Google Can Crawl Reddit"},"content":{"rendered":"<p><a href=\"https:\/\/www.404media.co\/google-is-the-only-search-engine-that-works-on-reddit-now-thanks-to-ai-deal\/\">Emanuel Maiberg<\/a> (<a href=\"https:\/\/news.ycombinator.com\/item?id=41057033\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/www.404media.co\/google-is-the-only-search-engine-that-works-on-reddit-now-thanks-to-ai-deal\/\">\n<p>Google is now the only search engine that can surface results from Reddit, making one of the web&rsquo;s most valuable repositories of user generated content exclusive to the internet&rsquo;s already dominant search engine.\nIf you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn&rsquo;t rely on Google&rsquo;s indexing and search Reddit by using &ldquo;site:reddit.com,&rdquo; you will not see any results from the last week.<\/p>\n<p>DuckDuckGo is currently turning up seven links when searching Reddit, but provides no data on where the links go or why, instead only saying that &ldquo;We would like to show you a description here but the site won't allow us.&rdquo; Older results will still show up, but these search engines are no longer able to &ldquo;crawl&rdquo; Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward. Searching for Reddit still works on Kagi, an independent, paid search engine that buys part of its search index from Google.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/simonwillison.net\/2024\/Jul\/24\/google-reddit\/\">Simon Willison<\/a>:<\/p>\n<blockquote cite=\"https:\/\/simonwillison.net\/2024\/Jul\/24\/google-reddit\/\"><p>Is this a direct result of Google&rsquo;s deal to license Reddit content for AI training, rumored <a href=\"https:\/\/www.reuters.com\/technology\/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22\/\">at $60 million<\/a>? That&rsquo;s not been confirmed but it looks likely, especially since accessing that <code>robots.txt<\/code> using the <a href=\"https:\/\/search.google.com\/test\/rich-results\">Google Rich Results testing tool<\/a> (hence proxied via their IP) appears to return a different file, via <a href=\"https:\/\/news.ycombinator.com\/item?id=41057033#41058375\">this comment<\/a>, <a href=\"https:\/\/gist.github.com\/simonw\/be0e8e595178207b1b3dce3b81eacfb3\">my copy here<\/a>.<\/p><\/blockquote>\n\n<p>As he says, this is depressing.<\/p>\n\n<p><a href=\"https:\/\/mas.to\/@carnage4life\/112845212291815664\">Dare Obasanjo<\/a>:<\/p>\n<blockquote cite=\"https:\/\/mas.to\/@carnage4life\/112845212291815664\">\n<p>The pay-to-play internet is here. [&#8230;] This pretty much kills any chance of disrupting Google with AI as they can outspend everyone on content exclusivity.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/infosec.exchange\/@skarra\/112845281046711009\">Sriram Karra<\/a>:<\/p>\n<blockquote cite=\"https:\/\/infosec.exchange\/@skarra\/112845281046711009\"><p>&ldquo;Pay to play&rdquo; arrived  years ago&#8230; Just that folks were not paying attention..<\/p><p>Microsoft did this with GitHub. You haven&rsquo;t been able to find any GitHub responses in Google searches for years.<\/p><\/blockquote>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/07\/25\/searchgpt\/\">SearchGPT<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/02\/29\/tumblr-and-wordpress-to-sell-users-data-to-train-ai-tools\/\">Tumblr and WordPress to Sell Users&rsquo; Data to Train AI Tools<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/02\/27\/reddit-ai-training-data-and-ipo\/\">Reddit AI Training Data and IPO<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/12\/11\/googles-gemini\/\">Google&rsquo;s Gemini<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/06\/14\/reddit-api-ama-and-user-revolt\/\">Reddit API AMA and User Revolt<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/06\/01\/reddit-to-charge-for-api\/\">Reddit to Charge for API<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2022\/02\/17\/google-search-is-dying\/\">Google Search Is Dying<\/a><\/li>\n<\/ul>\n\n<p id=\"only-google-can-crawl-reddit-update-2024-08-08\">Update (2024-08-08): <a href=\"https:\/\/pxlnv.com\/blog\/reddit-google-pairing\/\">Nick Heer<\/a>:<\/p>\n<blockquote cite=\"https:\/\/pxlnv.com\/blog\/reddit-google-pairing\/\">\n<p>It is unclear to me whether this is a deal only available to Google, or if it is open to any search engine that wants to pay. Even if it was intended to be exclusive, I have a feeling it <a href=\"https:\/\/www.bbc.com\/news\/articles\/c0k44x6mge3o\">might not be for much longer<\/a>. But it seems like something Reddit would only <em>care<\/em> about doing with Google because other search engines basically do not matter in the <a href=\"https:\/\/gs.statcounter.com\/search-engine-market-share\/all\/united-states-of-america\/#quarterly-201403-202403\">United States<\/a> or <a href=\"https:\/\/gs.statcounter.com\/search-engine-market-share#quarterly-201403-202403\">worldwide<\/a>.<sup id=\"fnref:1\"><a href=\"#fn:1\" rel=\"footnote\">1<\/a><\/sup> What amount of money do you think Microsoft would need to pay for Bing to be the sole permitted crawler of Reddit in exchange for traffic from its measly market share? I bet it is a <em>lot<\/em> more than $60 million.<\/p>\n\n<p>Maybe that is one reason this agreement feels uncomfortable to me. Search engines are marketed as finding results across the entire web but, of course, that is not true: they most often obey rules declared in robots.txt files, but they also <a href=\"https:\/\/www.vincentschmalbach.com\/google-now-defaults-to-not-indexing-your-content\/\">do not necessarily<\/a> index everything they are able to, either. These are not explicit limitations. Yet it feels like it violates the premise of a search engine to say that it will be allowed to crawl and link to other webpages. The whole thing about the web is that the links are free. There is no guarantee the actual page will be freely accessible, but the link itself is not restricted. It is the central problem with <a href=\"https:\/\/pxlnv.com\/linklog\/bill-c-18-link-tax\/\">link tax laws<\/a>, and this pay-to-index scheme is similarly restrictive.<\/p>\n<p>[&#8230;]<\/p>\n<p>The government attorneys said Bing is required to pay for structured data owing to its smaller size, while Google is able to obtain structured data for free because it sends partners so much traffic. The judge ultimately rejected their argument Microsoft struggled to sign these agreements or it was impeded in doing so, but did not dispute the difference in negotiating power between the two companies.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/www.404media.co\/microsoft-and-reddit-are-fighting-about-why-bings-crawler-is-blocked-on-reddit\/\">Emanuel Maiberg<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.404media.co\/microsoft-and-reddit-are-fighting-about-why-bings-crawler-is-blocked-on-reddit\/\">\n<p>Microsoft and Reddit are offering conflicting explanations for why Microsoft&rsquo;s search engine, Bing, is currently blocked from crawling Reddit and offering links from the site in its search results.<\/p>\n<p>Reddit, which now demands payment from anyone crawling the site and using its data to train AI products, claims that Bing&rsquo;s crawler is being used to power AI products. Microsoft claims it has made it easy for any site to block its crawler that&rsquo;s used for AI products, while still allowing a crawler that is only used for search results, and that Reddit&rsquo;s decision to block Bing is &ldquo;impacting competition&rdquo; in the search engine space.<\/p>\n<p>The conflicting reasonings behind the block are further proof that the massive, indiscriminate scraping of the internet to create AI training data in a way that violates long-respected norms about how to access information on the web are eroding trust, making the internet less open, and causing tech companies to beef about this issue in public.<\/p>\n<\/blockquote>\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/08\/06\/google-search-and-ads-monopoly\/\">Google Search and Ads Monopoly<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/24\/ai-companies-ignoring-robots-txt\/\">AI Companies Ignoring Robots.txt<\/a><\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Emanuel Maiberg (Hacker News): Google is now the only search engine that can surface results from Reddit, making one of the web&rsquo;s most valuable repositories of user generated content exclusive to the internet&rsquo;s already dominant search engine. If you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn&rsquo;t rely on Google&rsquo;s [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2024-07-25T18:30:47Z","apple_news_api_id":"22501e93-dbb1-4b00-88ab-25f5ab39d4be","apple_news_api_modified_at":"2024-08-08T14:08:13Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAABA==","apple_news_api_share_url":"https:\/\/apple.news\/AIlAek9uxSwCIqyX1qznUvg","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[1351,313,524,2286,1366,343,96,2612],"class_list":["post-44219","post","type-post","status-publish","format-standard","hentry","category-technology","tag-artificial-intelligence","tag-bing","tag-github","tag-google-search","tag-reddit","tag-search","tag-web","tag-web-crawlers"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/44219","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=44219"}],"version-history":[{"count":6,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/44219\/revisions"}],"predecessor-version":[{"id":44401,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/44219\/revisions\/44401"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=44219"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=44219"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=44219"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}