Microsoft’s Suleyman on AI Scraping
Mustafa Suleyman, the CEO of Microsoft AI, said this week that machine-learning companies can scrape most content published online and use it to train neural networks because it’s essentially “freeware.”
Shortly afterwards the Center for Investigative Reporting sued OpenAI and its largest investor Microsoft “for using the nonprofit news organization’s content without permission or offering compensation.”
[…]
Asked in an interview with CNBC’s Andrew Ross Sorkin at the Aspen Ideas Festival whether AI companies have effectively stolen the world’s intellectual property, Suleyman acknowledged the controversy and attempted to draw a distinction between content people put online and content backed by corporate copyright holders.
“I think that with respect to content that is already on the open web, the social contract of that content since the 1990s has been it is fair use,” he opined. “Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding.”
He also refers to robots.txt as a “grey area” that will “work its way through the courts.”
OpenAI and Anthropic are two big names found to be ignoring robots.txt, put in place by news publishers to block their web content being freely scraped for AI training data, I learned today.
Sean Hollister (via Dan Moren, Hacker News):
I am not a lawyer, but even I can tell you that the moment you create a work, it’s automatically protected by copyright in the US. You don’t even need to apply for it, and you certainly don’t void your rights just by publishing it on the web. In fact, it’s so difficult to waive your rights that lawyers had to come up with special web licenses to help!
Fair use, meanwhile, is not granted by a “social contract” — it’s granted by a court. It’s a legal defense that allows some uses of copyrighted material once that court weighs what you’re copying, why, how much, and whether it’ll harm the copyright owner.
As Claburn notes, many people have “compromised their rights” by posting their content on social media sites.
I don’t think that training an AI to the point where it can reproduce an article is fair use any more than photocopying a whole book or using a camera to record a movie is. But, as a practical matter, it seems like the AI companies are going to keep scraping and no one is going to stop them, except for the big names that will make licensing deals.
Previously:
- AI Companies Ignoring Robots.txt
- Apple Intelligence Training
- Reddit AI Training Data and IPO
- The New York Times Sues OpenAI
- Suing OpenAI and Meta for Copyright Infringement
4 Comments RSS · Twitter · Mastodon
Microsoft might wish to remember that it owes its existence to a teenager who argued that copying software was wrong because it takes people time to write software... and if people were not compensated, they wouldn't do it.
Oddly, when it no longer suits Microsoft, the time other people take to write books/articles/posts/comments actually is "freeware", and has no value.
The irony is off the scale, unless one concludes that Microsoft lies when it says it has a moral stance, and that its principle is simply taking what it can and making as much money as it can, the consequences be damned.
I don’t think that training an AI to the point where it can reproduce an article is fair use any more than photocopying a whole book or using a camera to record a movie is. Precisely. Indeed, what bothers me is that they're probably pushing the point so far that it will no longer be legal to use texts for ML's to learn simple things like grammar, and that does strike me as fair use, since there is no unfair competition with the original authors. Yes, overparameterized models are easier to train, but they should probably not be allowed since they obtain 0% training error (i.e. they reproduce their training materials perfectly).
Oh, and for all of you who think the AI is "learning" or "reading" the way we do, enjoy this surreal video. It's what happens if you train on too little data. Children don't make this type of mistakes.
The distinction between fair use and copyright violations are so difficult to draw in this scenario that I think the best solution, albeit one that will not be achievable, would be to just drastically reduce the duration of copyright.
Make it five years. That is plenty of time to make money from photographs, books, games, or articles, but copyrighted texts and images from five years ago are also perfectly fine for training LLMs.
I suspect this is all going to lead to a lot of people questioning whether they should publish their content online. Why bother if it's going to end up regurgitated by some huge corporation for $20 a month.