{"id":43837,"date":"2024-06-24T15:53:28","date_gmt":"2024-06-24T19:53:28","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=43837"},"modified":"2025-08-06T14:52:33","modified_gmt":"2025-08-06T18:52:33","slug":"ai-companies-ignoring-robots-txt","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2024\/06\/24\/ai-companies-ignoring-robots-txt\/","title":{"rendered":"AI Companies Ignoring Robots.txt"},"content":{"rendered":"<p><a href=\"https:\/\/www.fastcompany.com\/91144894\/perplexity-ai-ceo-aravind-srinivas-on-plagiarism-accusations\">Mark Sullivan<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.fastcompany.com\/91144894\/perplexity-ai-ceo-aravind-srinivas-on-plagiarism-accusations\"><p>The AI search startup Perplexity is in hot water in the wake of a <em>Wired<\/em> <a href=\"https:\/\/www.wired.com\/story\/perplexity-is-a-bullshit-machine\/\">investigation<\/a> revealing that the startup has been crawling content from websites that don&rsquo;t want to be crawled.<\/p><p>[&#8230;]<\/p><p>&ldquo;Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,&rdquo; said Perplexity cofounder and CEO Aravind Srinivas in a phone interview Friday. &ldquo;I think there is a basic misunderstanding of the way this works,&rdquo; Srinivas said. &ldquo;We don&rsquo;t just rely on our own web crawlers, we rely on third-party web crawlers as well.&rdquo;<\/p><p>Srinivas said the mysterious web crawler that <em>Wired<\/em> identified was not owned by Perplexity, but by a third-party provider of web crawling and indexing services. Srinivas would not say the name of the third-party provider, citing a Nondisclosure Agreement. Asked if Perplexity immediately called the third-parter crawler to tell them to stop crawling <em>Wired<\/em> content, Srinivas was non-committal. &ldquo;It&rsquo;s complicated,&rdquo; he said.<\/p><p>Srinivas also noted that the Robot Exclusion Protocol, which was first proposed in 1994, is &ldquo;not a legal framework.&rdquo; He suggested that the emergence of AI requires a new kind of working relationship between content creators, or publishers, and sites like his.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/pxlnv.com\/linklog\/perplexity-ceo-responds\/\">Nick Heer<\/a> (<a href=\"https:\/\/c.im\/@nickheer\/112671815471138598\">Mastodon<\/a>, <a href=\"https:\/\/news.ycombinator.com\/item?id=40779108\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/pxlnv.com\/linklog\/perplexity-ceo-responds\/\">\n<p>Srinivas is creating a clear difference between laws and principles because the legal implications are so far undecided, but it sure looks unethical that its service ignores the requests of publishers &mdash; no matter whether that is through first- or third-party means.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/www.wired.com\/story\/perplexity-plagiarized-our-story-about-how-perplexity-is-a-bullshit-machine\/\">Tim Marchman<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.wired.com\/story\/perplexity-plagiarized-our-story-about-how-perplexity-is-a-bullshit-machine\/\"><p>Earlier this week, WIRED published a <a href=\"https:\/\/www.wired.com\/story\/perplexity-is-a-bullshit-machine\/\">story<\/a> about the AI-powered search startup Perplexity, which Forbes has <a href=\"https:\/\/www.forbes.com\/sites\/randalllane\/2024\/06\/11\/why-perplexitys-cynical-theft-represents-everything-that-could-go-wrong-with-ai\/\">accused<\/a> of plagiarism. In it, my colleague Dhruv Mehrotra and I reported that the company was surreptitiously scraping, using crawlers to visit and download parts of websites from which developers had tried to block it, in violation of its own publicly stated <a href=\"https:\/\/docs.perplexity.ai\/docs\/perplexitybot\">policy<\/a> of honoring the Robots Exclusion Protocol.<\/p><p>[&#8230;]<\/p><p>After we published the story, I prompted three leading chatbots to tell me about the story. <a href=\"https:\/\/www.wired.com\/tag\/chatgpt\/\">OpenAI&rsquo;s ChatGPT<\/a> and <a href=\"https:\/\/www.wired.com\/story\/six-practical-tips-for-using-anthropic-claude-chatbot\/\">Anthropic&rsquo;s Claude<\/a> generated text offering hypotheses about the story&rsquo;s subject but noted that they had no access to the article. The Perplexity chatbot produced a six-paragraph, <a href=\"https:\/\/www.perplexity.ai\/search\/perplexity-is-a-41uH2h6JT0qazoM87BO.kw\">287-word text<\/a> closely summarizing the conclusions of the story and the evidence used to reach them. (According to WIRED&rsquo;s server logs, the same bot observed in our and Knight&rsquo;s findings, which is almost certainly linked to Perplexity but is not in its publicly listed IP range, attempted to access the article the day it was published, but was met with a 404 response. The company doesn&rsquo;t retain all its traffic logs, so this is not necessarily a complete picture of the bot&rsquo;s activity, or that of other Perplexity agents.) The original story is linked at the top of the generated text, and a small gray circle links out to the original following each of the last five paragraphs. The last third of the fifth paragraph exactly reproduces a sentence from the original: &ldquo;Instead, it invented a story about a young girl named Amelia who follows a trail of glowing mushrooms in a magical forest called Whisper Woods.&rdquo;<\/p><p>This struck me and my colleagues as plagiarism.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/www.businessinsider.com\/openai-anthropic-ai-ignore-rule-scraping-web-contect-robotstxt\">Kali Hays<\/a> (via <a href=\"https:\/\/mastodon.macstories.net\/@johnvoorhees\/112657270550067067\">John Voorhees<\/a>):<\/p>\n<blockquote cite=\"https:\/\/www.businessinsider.com\/openai-anthropic-ai-ignore-rule-scraping-web-contect-robotstxt\">\n<p>OpenAI and Anthropic have said publicly they respect robots.txt and blocks to their web crawlers.<\/p>\n<p>Yet, both companies are ignoring or circumventing such blocks, BI has learned.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/www.reuters.com\/technology\/artificial-intelligence\/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21\/\">Katie Paul<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.reuters.com\/technology\/artificial-intelligence\/multiple-ai-companies-bypassing-web-standard-scrape-publisher-sites-licensing-2024-06-21\/\"><p>TollBit said its analytics indicate &ldquo;numerous&rdquo; AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.<\/p><p>&ldquo;What this means in practical terms is that AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites,&rdquo; TollBit wrote. &ldquo;The more publisher logs we ingest, the more this pattern emerges.&rdquo;<\/p><\/blockquote>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/19\/apple-intelligence-training\/\">Apple Intelligence Training<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/12\/28\/the-new-york-times-sues-openai\/\">The New York Times Sues OpenAI<\/a><\/li>\n<\/ul>\n\n<p id=\"ai-companies-ignoring-robots-txt-update-2024-06-28\">Update (2024-06-28): <a href=\"https:\/\/www.theverge.com\/2024\/6\/27\/24187405\/perplexity-ai-twitter-lie-plagiarism\">Elizabeth Lopatto<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.theverge.com\/2024\/6\/27\/24187405\/perplexity-ai-twitter-lie-plagiarism\">\n<p>&ldquo;Someone else did it&rdquo; is a fine argument for a five-year-old. And consider the response further. If Srinivas wanted to be ethical, he had some options here. Option one is to terminate the contract with the third-party scraper. Option two is to try to convince the scraper to honor robots.txt. Srinivas didn&rsquo;t commit to either, and it seems to me, there&rsquo;s a clear reason why. Even if Perplexity itself isn&rsquo;t violating the code, it is reliant on someone else violating the code for its &ldquo;answer engine&rdquo; to work.<\/p>\n<\/blockquote>\n\n<p id=\"ai-companies-ignoring-robots-txt-update-2024-07-05\">Update (2024-07-05): See also: <a href=\"https:\/\/atp.fm\/594\">Accidental Tech Podcast<\/a>.<\/p>\n\n<p id=\"ai-companies-ignoring-robots-txt-update-2025-04-29\">Update (<a href=\"#ai-companies-ignoring-robots-txt-update-2025-04-29\">2025-04-29<\/a>): <a href=\"https:\/\/pxlnv.com\/blog\/carelessness-of-perplexity\/\">Nick Heer<\/a>:<\/p>\n<blockquote cite=\"https:\/\/pxlnv.com\/blog\/carelessness-of-perplexity\/\">\n<p><a href=\"https:\/\/www.theverge.com\/command-line-newsletter\/656599\/perplexitys-ceo-on-fighting-google-and-the-coming-ai-browser-war\">Alex Heath<\/a>, of the <em>Verge<\/em>, spoke with Aravind Srinivas, CEO of Perplexity, earlier this week, and they had quite the conversation.<\/p>\n\n<blockquote>\n  <p><strong>Many publishers have been upset with you for scraping their content. You&rsquo;ve started cutting some of them checks. Do you feel like you&rsquo;re in a good place with publishers now, or do you feel there&rsquo;s still more work to be done?<\/strong><\/p>\n  \n  <p>I&rsquo;m sure there&rsquo;s more work to be done, but it&rsquo;s in a way better place than it was last time we spoke. We are scraping but respecting robots.txt. We only use third-party data providers for anything that doesn&rsquo;t allow us to scrape.<\/p><\/blockquote>\n<p>[&#8230;]<\/p>\n<p>Perplexity is another careless business. It does not care if a website has specifically prohibited it from scraping; Perplexity will simply rely on a third-party scraper.<\/p>\n<\/blockquote>\n\n<p id=\"ai-companies-ignoring-robots-txt-update-2025-08-04\">Update (<a href=\"#ai-companies-ignoring-robots-txt-update-2025-08-04\">2025-08-04<\/a>): <a href=\"https:\/\/blog.cloudflare.com\/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives\/\">Cloudflare<\/a> (<a href=\"https:\/\/news.ycombinator.com\/item?id=44785636\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/blog.cloudflare.com\/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives\/\">\n<p>We are observing stealth crawling behavior from Perplexity, an AI-powered answer engine. Although Perplexity initially crawls from their declared user agent, when they are presented with a network block, they appear to obscure their crawling identity in an attempt to circumvent the website&rsquo;s preferences. We see continued evidence that Perplexity is repeatedly modifying their user agent and changing their source <a href=\"https:\/\/www.cloudflare.com\/learning\/network-layer\/what-is-an-autonomous-system\/\"><u>ASNs<\/u><\/a> to hide their crawling activity, as well as ignoring &mdash; or sometimes failing to even fetch &mdash; <a href=\"https:\/\/www.cloudflare.com\/learning\/bots\/what-is-robots-txt\/\"><u>robots.txt<\/u> <\/a>files.<\/p>\n<\/blockquote>\n\n<p id=\"ai-companies-ignoring-robots-txt-update-2025-08-06\">Update (<a href=\"#ai-companies-ignoring-robots-txt-update-2025-08-06\">2025-08-06<\/a>): <a href=\"https:\/\/www.perplexity.ai\/hub\/blog\/agents-or-bots-making-sense-of-ai-on-the-open-web\">Perplexity<\/a> (<a href=\"https:\/\/www.theregister.com\/2025\/08\/05\/perplexity_vexed_by_cloudflares_claims\/\">The Register<\/a>):<\/p>\n<blockquote cite=\"https:\/\/www.perplexity.ai\/hub\/blog\/agents-or-bots-making-sense-of-ai-on-the-open-web\"><p>Cloudflare&rsquo;s recent blog post managed to get almost everything wrong about how modern AI assistants actually work.\nIn addition to misunderstanding 20-25M user agent requests are not scrapers, Cloudflare claimed that Perplexity was engaging in &ldquo;stealth crawling,&rdquo; using hidden bots and impersonation tactics to bypass website restrictions. But the technical facts tell a different story.<\/p><\/blockquote>\n\n<p>Via <a href=\"https:\/\/daringfireball.net\/linked\/2025\/08\/05\/cloudflare-perplexity\">John Gruber<\/a>:<\/p>\n<blockquote cite=\"https:\/\/daringfireball.net\/linked\/2025\/08\/05\/cloudflare-perplexity\"><p>And nothing in Perplexity&rsquo;s response attempts to explain Cloudflare&rsquo;s accusation that Perplexity is adopting a false generic user-agent when their own declared user-agents are disallowed.<\/p><\/blockquote>","protected":false},"excerpt":{"rendered":"<p>Mark Sullivan: The AI search startup Perplexity is in hot water in the wake of a Wired investigation revealing that the startup has been crawling content from websites that don&rsquo;t want to be crawled.[&#8230;]&ldquo;Perplexity is not ignoring the Robot Exclusions Protocol and then lying about it,&rdquo; said Perplexity cofounder and CEO Aravind Srinivas in a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2024-06-24T19:53:31Z","apple_news_api_id":"4d5c3610-ddc3-4c09-aec4-0dd38e69f554","apple_news_api_modified_at":"2025-08-06T18:52:36Z","apple_news_api_revision":"AAAAAAAAAAAAAAAAAAAABQ==","apple_news_api_share_url":"https:\/\/apple.news\/ATVw2EN3DTAmuxA3Tjmn1VA","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[2615,1351,167,2361,2613,96,2612],"class_list":["post-43837","post","type-post","status-publish","format-standard","hentry","category-technology","tag-anthropic","tag-artificial-intelligence","tag-copyright","tag-openai","tag-perplexity","tag-web","tag-web-crawlers"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43837","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=43837"}],"version-history":[{"count":7,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43837\/revisions"}],"predecessor-version":[{"id":48795,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43837\/revisions\/48795"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=43837"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=43837"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=43837"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}