Monday, June 24, 2024

AI Companies Ignoring Robots.txt

Mark Sullivan:

The AI search startup Perplexity is in hot water in the wake of a Wired investigation revealing that the startup has been crawling content from websites that don’t want to be crawled.

[…]

“Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,” said Perplexity cofounder and CEO Aravind Srinivas in a phone interview Friday. “I think there is a basic misunderstanding of the way this works,” Srinivas said. “We don’t just rely on our own web crawlers, we rely on third-party web crawlers as well.”

Srinivas said the mysterious web crawler that Wired identified was not owned by Perplexity, but by a third-party provider of web crawling and indexing services. Srinivas would not say the name of the third-party provider, citing a Nondisclosure Agreement. Asked if Perplexity immediately called the third-parter crawler to tell them to stop crawling Wired content, Srinivas was non-committal. “It’s complicated,” he said.

Srinivas also noted that the Robot Exclusion Protocol, which was first proposed in 1994, is “not a legal framework.” He suggested that the emergence of AI requires a new kind of working relationship between content creators, or publishers, and sites like his.

Nick Heer (Mastodon, Hacker News):

Srinivas is creating a clear difference between laws and principles because the legal implications are so far undecided, but it sure looks unethical that its service ignores the requests of publishers — no matter whether that is through first- or third-party means.

Tim Marchman:

Earlier this week, WIRED published a story about the AI-powered search startup Perplexity, which Forbes has accused of plagiarism. In it, my colleague Dhruv Mehrotra and I reported that the company was surreptitiously scraping, using crawlers to visit and download parts of websites from which developers had tried to block it, in violation of its own publicly stated policy of honoring the Robots Exclusion Protocol.

[…]

After we published the story, I prompted three leading chatbots to tell me about the story. OpenAI’s ChatGPT and Anthropic’s Claude generated text offering hypotheses about the story’s subject but noted that they had no access to the article. The Perplexity chatbot produced a six-paragraph, 287-word text closely summarizing the conclusions of the story and the evidence used to reach them. (According to WIRED’s server logs, the same bot observed in our and Knight’s findings, which is almost certainly linked to Perplexity but is not in its publicly listed IP range, attempted to access the article the day it was published, but was met with a 404 response. The company doesn’t retain all its traffic logs, so this is not necessarily a complete picture of the bot’s activity, or that of other Perplexity agents.) The original story is linked at the top of the generated text, and a small gray circle links out to the original following each of the last five paragraphs. The last third of the fifth paragraph exactly reproduces a sentence from the original: “Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods.”

This struck me and my colleagues as plagiarism.

Kali Hays (via John Voorhees):

OpenAI and Anthropic have said publicly they respect robots.txt and blocks to their web crawlers.

Yet, both companies are ignoring or circumventing such blocks, BI has learned.

Katie Paul:

TollBit said its analytics indicate “numerous” AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

“What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites,” TollBit wrote. “The more publisher logs we ingest, the more this pattern emerges.”

Previously:

Update (2024-06-28): Elizabeth Lopatto:

“Someone else did it” is a fine argument for a five-year-old. And consider the response further. If Srinivas wanted to be ethical, he had some options here. Option one is to terminate the contract with the third-party scraper. Option two is to try to convince the scraper to honor robots.txt. Srinivas didn’t commit to either, and it seems to me, there’s a clear reason why. Even if Perplexity itself isn’t violating the code, it is reliant on someone else violating the code for its “answer engine” to work.

Update (2024-07-05): See also: Accidental Tech Podcast.

7 Comments RSS · Twitter · Mastodon


"Srinivas also noted that the Robot Exclusion Protocol, which was first proposed in 1994, is not a legal framework."

But the publishers own the copyright, they're not licensing their content to you, but they're telling you what you are allowed to do via their robots.txt. So how is it legal to just ignore that?


Um no. You cannot sue people for looking at your shitty public website.


You’re putting your content on the public Internet. Scrapers obeying the robots.txt is just a courtesy, there’s no requirement for them to ignore completely public information.

And, just as with a book you might read, the publisher doesn’t get to tell you how you can or can’t use that information. You can’t quote the entire publication, but you can quote parts of it within the bounds of fair use. You are free, without limitation, to incorporate the information learned into your own thought process. You can summarize the information. You can use the information to inform decisions.


"Scrapers obeying the robots.txt is just a courtesy, there’s no requirement for them to ignore completely public information."

This is just factually incorrect. You are not giving up your copyright by putting things on the Internet.

"You can’t quote the entire publication, but you can quote parts of it within the bounds of fair use"

Fair use is not a law, it's an interpretation of the law that a judge has to make for you.


> You are not giving up your copyright by putting things on the Internet.

For example, this web site is copyright by Michael Tsai. It grants certain rights to the author. Despite this web site being copyrighted, it is legal for you to read the site, or even to remember or to memorize exactly what this web site said. It would be illegal for you to republish the site’s articles without Michael’s permission.


Old Unix Geek

Putting materials on the internet does not magically make it public domain. Furthermore, computers are not people.

Copyright has to do with people, who by default are unable to reproduce large texts verbatim. Copyright assumes that people read texts, and integrate the knowledge they derive into their models of the world. However even if you have eidetic memory, you still can't recite the text you read exactly to an audience without violating copyright. Some exception is granted to short snippets because the probability that you came up with them yourself is quite high.

Copyright has nothing to do with machines, which by default reproduce everything verbatim. In fact it's quite difficult (witness enormous LLM training costs) to make a machine paraphrase your text correctly. Therefore, it seems pretty clear that machines should not ignore robots.txt, even if the people wearing wigs have yet to update the law.

However if Silicon Valley continues plundering the web and other people's books, claiming "fair use", the knowledge sphere provided by these sources may degrade substantially. No one likes spending time making something only for someone else to grab it and profit from it, depriving one from the fruit of one's labors, so people will stop producing this kind of work.


"it is legal for you to read the site, or even to remember or to memorize exactly what this web site said"

This does not support your claim at all, Nate. In fact, it contradicts it.

Leave a Comment