{"id":43916,"date":"2024-07-01T17:21:34","date_gmt":"2024-07-01T21:21:34","guid":{"rendered":"https:\/\/mjtsai.com\/blog\/?p=43916"},"modified":"2024-07-01T17:21:34","modified_gmt":"2024-07-01T21:21:34","slug":"microsofts-suleyman-on-ai-scraping","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2024\/07\/01\/microsofts-suleyman-on-ai-scraping\/","title":{"rendered":"Microsoft&rsquo;s Suleyman on AI Scraping"},"content":{"rendered":"<p><a href=\"https:\/\/www.theregister.com\/2024\/06\/28\/microsoft_ceo_ai\/\">Thomas Claburn<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.theregister.com\/2024\/06\/28\/microsoft_ceo_ai\/\"><p>Mustafa Suleyman, the CEO of Microsoft AI, said this week that machine-learning companies can scrape most content published online and use it to train neural networks because it&rsquo;s essentially &ldquo;freeware.&rdquo;<\/p><p>Shortly afterwards the Center for Investigative Reporting <a href=\"https:\/\/revealnews.org\/press\/cir-sues-openai\/\">sued OpenAI<\/a> and its largest investor Microsoft &ldquo;for using the nonprofit news organization&rsquo;s content without permission or offering compensation.&rdquo;<\/p><p>[&#8230;]<\/p><p>Asked in <a href=\"https:\/\/youtu.be\/lPvqvt55l3A?feature=shared&amp;t=872\">an interview<\/a> with CNBC&rsquo;s Andrew Ross Sorkin at the Aspen Ideas Festival whether AI companies have effectively stolen the world&rsquo;s intellectual property, Suleyman acknowledged the controversy and attempted to draw a distinction between content people put online and content backed by corporate copyright holders.<\/p><p>&ldquo;I think that with respect to content that is already on the open web, the social contract of that content since the 1990s has been it is fair use,&rdquo; he opined. &ldquo;Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That&rsquo;s been the understanding.&rdquo;<\/p><\/blockquote>\n\n<p>He also refers to <tt>robots.txt<\/tt> as a &ldquo;grey area&rdquo; that will &ldquo;work its way through the courts.&rdquo;<\/p>\n\n<p><a href=\"https:\/\/www.threads.net\/@kalihays1\/post\/C8fnWgHygmn?hl=en\">Kali Hays<\/a>:<\/p>\n<blockquote cite=\"https:\/\/www.threads.net\/@kalihays1\/post\/C8fnWgHygmn?hl=en\">\n<p>OpenAI and Anthropic are two big names found to be ignoring robots.txt, put in place by news publishers to block their web content being freely scraped for AI training data, I learned today.<\/p>\n<\/blockquote>\n\n<p><a href=\"https:\/\/www.theverge.com\/2024\/6\/28\/24188391\/microsoft-ai-suleyman-social-contract-freeware\">Sean Hollister<\/a> (via <a href=\"https:\/\/zeppelin.flights\/@dmoren\/112696246172728830\">Dan Moren<\/a>, <a href=\"https:\/\/news.ycombinator.com\/item?id=40833323\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/www.theverge.com\/2024\/6\/28\/24188391\/microsoft-ai-suleyman-social-contract-freeware\">\n<p>I am not a lawyer, but even I can tell you that <a href=\"https:\/\/www.copyright.gov\/help\/faq\/faq-general.html#:~:text=Your%20work%20is%20under%20copyright%20protection%20the%20moment%20it%20is%20created\">the moment you create a work<\/a>, it&rsquo;s automatically protected by copyright in the US. You don&rsquo;t even need to apply for it, and you certainly don&rsquo;t void your rights just by publishing it on the web. In fact, it&rsquo;s <a href=\"https:\/\/creativecommons.org\/public-domain\/cc0\/\">so difficult to waive your rights<\/a> that lawyers had to come up with <a href=\"https:\/\/en.wikipedia.org\/wiki\/Copyleft#:~:text=Copyleft%20is%20the%20legal%20technique,be%20preserved%20in%20derivative%20works.\">special web licenses<\/a> to help!<\/p>\n<p>Fair use, meanwhile, is not granted by a &ldquo;social contract&rdquo; &mdash; it&rsquo;s granted by a court. It&rsquo;s a legal defense that allows <em>some<\/em> uses of copyrighted material once that court <a href=\"https:\/\/www.copyright.gov\/fair-use\/#:~:text=Fair%20use%20is%20a%20legal,protected%20works%20in%20certain%20circumstances.\">weighs what you&rsquo;re copying, why, how much, and whether it&rsquo;ll harm the copyright owner<\/a>. <\/p>\n<\/blockquote>\n\n<p>As Claburn notes, many people have &ldquo;compromised their rights&rdquo; by posting their content on social media sites.<\/p>\n\n<p>I don&rsquo;t think that training an AI to the point where it can reproduce an article is fair use any more than photocopying a whole book or <a href=\"https:\/\/news.ycombinator.com\/item?id=40834284\">using a camera<\/a> to record a movie is. But, as a practical matter, it seems like the AI companies are going to keep scraping and no one is going to stop them, except for the big names that will make licensing deals.<\/p>\n\n<p>Previously:<\/p>\n<ul>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/24\/ai-companies-ignoring-robots-txt\/\">AI Companies Ignoring Robots.txt<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/06\/19\/apple-intelligence-training\/\">Apple Intelligence Training<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2024\/02\/27\/reddit-ai-training-data-and-ipo\/\">Reddit AI Training Data and IPO<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/12\/28\/the-new-york-times-sues-openai\/\">The New York Times Sues OpenAI<\/a><\/li>\n<li><a href=\"https:\/\/mjtsai.com\/blog\/2023\/07\/11\/suing-openai-and-meta-for-copyright-infringement\/\">Suing OpenAI and Meta for Copyright Infringement<\/a><\/li>\n<\/ul>","protected":false},"excerpt":{"rendered":"<p>Thomas Claburn: Mustafa Suleyman, the CEO of Microsoft AI, said this week that machine-learning companies can scrape most content published online and use it to train neural networks because it&rsquo;s essentially &ldquo;freeware.&rdquo;Shortly afterwards the Center for Investigative Reporting sued OpenAI and its largest investor Microsoft &ldquo;for using the nonprofit news organization&rsquo;s content without permission or [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2024-07-01T21:21:37Z","apple_news_api_id":"78188ce4-1dd9-4540-b934-f8f51dd7907d","apple_news_api_modified_at":"2024-07-01T21:21:37Z","apple_news_api_revision":"AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/w==","apple_news_api_share_url":"https:\/\/apple.news\/AeBiM5B3ZRUC5NPj1HdeQfQ","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[1351,101,167,209,37,2361,96,2612],"class_list":["post-43916","post","type-post","status-publish","format-standard","hentry","category-technology","tag-artificial-intelligence","tag-business","tag-copyright","tag-legal","tag-microsoft","tag-openai","tag-web","tag-web-crawlers"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43916","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=43916"}],"version-history":[{"count":1,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43916\/revisions"}],"predecessor-version":[{"id":43917,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/43916\/revisions\/43917"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=43916"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=43916"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=43916"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}