Monday, June 24, 2024

AI Companies Ignoring Robots.txt

The AI search startup Perplexity is in hot water in the wake of a Wired investigation revealing that the startup has been crawling content from websites that don’t want to be crawled.
[…]
“Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,” said Perplexity cofounder and CEO Aravind Srinivas in a phone interview Friday. “I think there is a basic misunderstanding of the way this works,” Srinivas said. “We don’t just rely on our own web crawlers, we rely on third-party web crawlers as well.”
Srinivas said the mysterious web crawler that Wired identified was not owned by Perplexity, but by a third-party provider of web crawling and indexing services. Srinivas would not say the name of the third-party provider, citing a Nondisclosure Agreement. Asked if Perplexity immediately called the third-parter crawler to tell them to stop crawling Wired content, Srinivas was non-committal. “It’s complicated,” he said.
Srinivas also noted that the Robot Exclusion Protocol, which was first proposed in 1994, is “not a legal framework.” He suggested that the emergence of AI requires a new kind of working relationship between content creators, or publishers, and sites like his.

Nick Heer (Mastodon, Hacker News):

Srinivas is creating a clear difference between laws and principles because the legal implications are so far undecided, but it sure looks unethical that its service ignores the requests of publishers — no matter whether that is through first- or third-party means.

Tim Marchman:

Earlier this week, WIRED published a story about the AI-powered search startup Perplexity, which Forbes has accused of plagiarism. In it, my colleague Dhruv Mehrotra and I reported that the company was surreptitiously scraping, using crawlers to visit and download parts of websites from which developers had tried to block it, in violation of its own publicly stated policy of honoring the Robots Exclusion Protocol.
[…]
After we published the story, I prompted three leading chatbots to tell me about the story. OpenAI’s ChatGPT and Anthropic’s Claude generated text offering hypotheses about the story’s subject but noted that they had no access to the article. The Perplexity chatbot produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them. (According to WIRED’s server logs, the same bot observed in our and Knight’s findings, which is almost certainly linked to Perplexity but is not in its publicly listed IP range, attempted to access the article the day it was published, but was met with a 404 response. The company doesn’t retain all its traffic logs, so this is not necessarily a complete picture of the bot’s activity, or that of other Perplexity agents.) The original story is linked at the top of the generated text, and a small gray circle links out to the original following each of the last five paragraphs. The last third of the fifth paragraph exactly reproduces a sentence from the original: “Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods.”
This struck me and my colleagues as plagiarism.

Kali Hays (via John Voorhees):

OpenAI and Anthropic have said publicly they respect robots.txt and blocks to their web crawlers.

Yet, both companies are ignoring or circumventing such blocks, BI has learned.

Katie Paul:

TollBit said its analytics indicate “numerous” AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.
“What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites,” TollBit wrote. “The more publisher logs we ingest, the more this pattern emerges.”

Previously:

Update (2024-06-28): Elizabeth Lopatto:

“Someone else did it” is a fine argument for a five-year-old. And consider the response further. If Srinivas wanted to be ethical, he had some options here. Option one is to terminate the contract with the third-party scraper. Option two is to try to convince the scraper to honor robots.txt. Srinivas didn’t commit to either, and it seems to me, there’s a clear reason why. Even if Perplexity itself isn’t violating the code, it is reliant on someone else violating the code for its “answer engine” to work.

Update (2024-07-05): See also: Accidental Tech Podcast.

Update (2025-04-29): Nick Heer:

Alex Heath, of the Verge, spoke with Aravind Srinivas, CEO of Perplexity, earlier this week, and they had quite the conversation.

Many publishers have been upset with you for scraping their content. You’ve started cutting some of them checks. Do you feel like you’re in a good place with publishers now, or do you feel there’s still more work to be done?

I’m sure there’s more work to be done, but it’s in a way better place than it was last time we spoke. We are scraping but respecting robots.txt. We only use third-party data providers for anything that doesn’t allow us to scrape.

[…]

Perplexity is another careless business. It does not care if a website has specifically prohibited it from scraping; Perplexity will simply rely on a third-party scraper.

Update (2025-08-04): Cloudflare (Hacker News):

We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website’s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source ASNs to hide their crawling activity, as well as ignoring — or sometimes failing to even fetch — robots.txt files.

Update (2025-08-06): Perplexity (The Register):

Cloudflare’s recent blog post managed to get almost everything wrong about how modern AI assistants actually work. In addition to misunderstanding 20-25M user agent requests are not scrapers, Cloudflare claimed that Perplexity was engaging in “stealth crawling,” using hidden bots and impersonation tactics to bypass website restrictions. But the technical facts tell a different story.

Via John Gruber:

And nothing in Perplexity’s response attempts to explain Cloudflare’s accusation that Perplexity is adopting a false generic user-agent when their own declared user-agents are disallowed.

Anthropic Artificial Intelligence Copyright OpenAI Perplexity Web Web Crawlers

8 Comments RSS · Twitter · Mastodon

Plume

June 25, 2024 2:42 AM

"Srinivas also noted that the Robot Exclusion Protocol, which was first proposed in 1994, is not a legal framework."

But the publishers own the copyright, they're not licensing their content to you, but they're telling you what you are allowed to do via their robots.txt. So how is it legal to just ignore that?

Scineram

June 25, 2024 4:40 AM

Um no. You cannot sue people for looking at your shitty public website.

Nate

June 25, 2024 3:15 PM

You’re putting your content on the public Internet. Scrapers obeying the robots.txt is just a courtesy, there’s no requirement for them to ignore completely public information.

And, just as with a book you might read, the publisher doesn’t get to tell you how you can or can’t use that information. You can’t quote the entire publication, but you can quote parts of it within the bounds of fair use. You are free, without limitation, to incorporate the information learned into your own thought process. You can summarize the information. You can use the information to inform decisions.

Plume

June 25, 2024 3:35 PM

"Scrapers obeying the robots.txt is just a courtesy, there’s no requirement for them to ignore completely public information."

This is just factually incorrect. You are not giving up your copyright by putting things on the Internet.

"You can’t quote the entire publication, but you can quote parts of it within the bounds of fair use"

Fair use is not a law, it's an interpretation of the law that a judge has to make for you.

Nate

June 25, 2024 4:27 PM

> You are not giving up your copyright by putting things on the Internet.

For example, this web site is copyright by Michael Tsai. It grants certain rights to the author. Despite this web site being copyrighted, it is legal for you to read the site, or even to remember or to memorize exactly what this web site said. It would be illegal for you to republish the site’s articles without Michael’s permission.

Old Unix Geek

June 25, 2024 6:28 PM

Putting materials on the internet does not magically make it public domain. Furthermore, computers are not people.

Copyright has to do with people, who by default are unable to reproduce large texts verbatim. Copyright assumes that people read texts, and integrate the knowledge they derive into their models of the world. However even if you have eidetic memory, you still can't recite the text you read exactly to an audience without violating copyright. Some exception is granted to short snippets because the probability that you came up with them yourself is quite high.

Copyright has nothing to do with machines, which by default reproduce everything verbatim. In fact it's quite difficult (witness enormous LLM training costs) to make a machine paraphrase your text correctly. Therefore, it seems pretty clear that machines should not ignore robots.txt, even if the people wearing wigs have yet to update the law.

However if Silicon Valley continues plundering the web and other people's books, claiming "fair use", the knowledge sphere provided by these sources may degrade substantially. No one likes spending time making something only for someone else to grab it and profit from it, depriving one from the fruit of one's labors, so people will stop producing this kind of work.

Plume

June 26, 2024 3:30 AM

"it is legal for you to read the site, or even to remember or to memorize exactly what this web site said"

This does not support your claim at all, Nate. In fact, it contradicts it.

Sören

August 7, 2025 5:34 PM

Perplexity is making an interesting case (is it still “crawling” if it isn’t fetching entire swaths of the Web wholesale), and perhaps robots.txt inadequately distinguishes there.

But whether they call it a “bot”, “agent”, or the latest buzzword doesn’t change that this isn’t a web browser interactively used by a human. It’s closer to “scraping” than crawling, but there’s still a good chance the website owner has unfavorable opinions on it, and Perplexity knows that; that’s why they hide behind stealth user agents. Which they haven’t addressed at all.

AI Companies Ignoring Robots.txt

8 Comments RSS · Twitter · Mastodon

Leave a Comment