Only Google Can Crawl Reddit
Emanuel Maiberg (Hacker News):
Google is now the only search engine that can surface results from Reddit, making one of the web’s most valuable repositories of user generated content exclusive to the internet’s already dominant search engine. If you use Bing, DuckDuckGo, Mojeek, Qwant or any other alternative search engine that doesn’t rely on Google’s indexing and search Reddit by using “site:reddit.com,” you will not see any results from the last week.
DuckDuckGo is currently turning up seven links when searching Reddit, but provides no data on where the links go or why, instead only saying that “We would like to show you a description here but the site won't allow us.” Older results will still show up, but these search engines are no longer able to “crawl” Reddit, meaning that Google is the only search engine that will turn up results from Reddit going forward. Searching for Reddit still works on Kagi, an independent, paid search engine that buys part of its search index from Google.
Is this a direct result of Google’s deal to license Reddit content for AI training, rumored at $60 million? That’s not been confirmed but it looks likely, especially since accessing that
robots.txt
using the Google Rich Results testing tool (hence proxied via their IP) appears to return a different file, via this comment, my copy here.
As he says, this is depressing.
The pay-to-play internet is here. […] This pretty much kills any chance of disrupting Google with AI as they can outspend everyone on content exclusivity.
“Pay to play” arrived years ago… Just that folks were not paying attention..
Microsoft did this with GitHub. You haven’t been able to find any GitHub responses in Google searches for years.
Previously:
- SearchGPT
- Tumblr and WordPress to Sell Users’ Data to Train AI Tools
- Reddit AI Training Data and IPO
- Google’s Gemini
- Reddit API AMA and User Revolt
- Reddit to Charge for API
- Google Search Is Dying
Update (2024-08-08): Nick Heer:
It is unclear to me whether this is a deal only available to Google, or if it is open to any search engine that wants to pay. Even if it was intended to be exclusive, I have a feeling it might not be for much longer. But it seems like something Reddit would only care about doing with Google because other search engines basically do not matter in the United States or worldwide.1 What amount of money do you think Microsoft would need to pay for Bing to be the sole permitted crawler of Reddit in exchange for traffic from its measly market share? I bet it is a lot more than $60 million.
Maybe that is one reason this agreement feels uncomfortable to me. Search engines are marketed as finding results across the entire web but, of course, that is not true: they most often obey rules declared in robots.txt files, but they also do not necessarily index everything they are able to, either. These are not explicit limitations. Yet it feels like it violates the premise of a search engine to say that it will be allowed to crawl and link to other webpages. The whole thing about the web is that the links are free. There is no guarantee the actual page will be freely accessible, but the link itself is not restricted. It is the central problem with link tax laws, and this pay-to-index scheme is similarly restrictive.
[…]
The government attorneys said Bing is required to pay for structured data owing to its smaller size, while Google is able to obtain structured data for free because it sends partners so much traffic. The judge ultimately rejected their argument Microsoft struggled to sign these agreements or it was impeded in doing so, but did not dispute the difference in negotiating power between the two companies.
Microsoft and Reddit are offering conflicting explanations for why Microsoft’s search engine, Bing, is currently blocked from crawling Reddit and offering links from the site in its search results.
Reddit, which now demands payment from anyone crawling the site and using its data to train AI products, claims that Bing’s crawler is being used to power AI products. Microsoft claims it has made it easy for any site to block its crawler that’s used for AI products, while still allowing a crawler that is only used for search results, and that Reddit’s decision to block Bing is “impacting competition” in the search engine space.
The conflicting reasonings behind the block are further proof that the massive, indiscriminate scraping of the internet to create AI training data in a way that violates long-respected norms about how to access information on the web are eroding trust, making the internet less open, and causing tech companies to beef about this issue in public.
Previously:
7 Comments RSS · Twitter · Mastodon
> Microsoft did this with GitHub. You haven’t been able to find any GitHub responses in Google searches for years.
Holy crap this is true! I just tried this and its absolutely right! I have a very unique software repo name (made relatively recently) with some traffic that should absolutely be indexed, when I look it up with or without quotes, even with my handle, it doesn't show up! I get older but less relevant results, with nothing newer than 2021. This is crazy I had no idea! It works with some of my colleagues' repos as well!
Is there a way to permanently hide the Reddit results? Those are one reason of many which make me want to switch my search engine.
Somebody needs to tell the EU about this.
"Is there a way to permanently hide the Reddit results?"
There are some queries where only Reddit provides useful, reasonably trustworthy answers. Usually things that are somewhat illegal.
I used to hate reddit. Then I got into niche hobbies, or found niche communities in bigger hobbies, and now I love the parts of reddit that I frequent.
I hate the official book reddit. I hate the biggest boardgaming reddit. The issue isn't the ugly but that things just droooown in thousands of posts.
It's ironic that what started as a hack to get a semblance of quality from Google (just add reddit to your search) is now money in someones pocket PLUS a cause for ridicule of googles "ai" search.
The reason why this is a big is because normal search results are worthless now. I don't know what they did to the algorithm, but they are useless. Even worse, they just repeat after page 3 or something.
But if you add "reddit" to the end of your searches, you actually get useful answers to your query, because you skip the bad algorithm and just are searching on a single site.
So this sucks for trying to continue to find good answers (and I'm not going back to Google to reward this behavior - I haven't used them for 10 years).
Honestly, places should just start ignoring robots.txt if its going to be used in this fashion.
As I wrote in another forum, Reddit is playing a dangerous game.
They are assuming that this will force users to switch their preferred search engine to Google in order to access Reddit content. But most people who are not using Google switched away because they don't like Google for a wide variety of reasons and may decide that if Reddit is going to take sides, they're taking the wrong side.
Reddit may quickly find themselves losing a lot of readers due to their pages no longer appearing in searches. That's not good for any web site. Google won't care, but Reddit should.