Thursday, December 28, 2023

The New York Times Sues OpenAI

Emma Roth (Hacker News):

The New York Times is suing OpenAI and Microsoft for copyright infringement, claiming the two companies built their AI models by “copying and using millions” of the publication’s articles and now “directly compete” with its content as a result.

As outlined in the lawsuit, the Times alleges OpenAI and Microsoft’s large language models (LLMs), which power ChatGPT and Copilot, “can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style.” This “undermine[s] and damage[s]” the Times’ relationship with readers, the outlet alleges, while also depriving it of “subscription, licensing, advertising, and affiliate revenue.”

John Timmer (Hacker News):

The Times is targeting various companies under the OpenAI umbrella, as well as Microsoft, an OpenAI partner that both uses it to power its Copilot service and helped provide the infrastructure for training the GPT Large Language Model. But the suit goes well beyond the use of copyrighted material in training, alleging that OpenAI-powered software will happily circumvent the Times’ paywall and ascribe hallucinated misinformation to the Times.


Part of the unauthorized use The Times alleges came during the training of various versions of GPT. Prior to GPT-3.5, information about the training dataset was made public. One of the sources used is a large collection of online material called “Common Crawl,” which the suit alleges contains information from 16 million unique records from sites published by The Times. That places the Times as the third most referenced source, behind Wikipedia and a database of US patents.

OpenAI no longer discloses as many details of the data used for training of recent GPT versions, but all indications are that full-text NY Times articles are still part of that process (Much more on that in a moment.) Expect access to training information to be a major issue during discovery if this case moves forward.

Benjamin Mullin and Tripp Mickle:

Apple has opened negotiations in recent weeks with major news and publishing organizations, seeking permission to use their material in the company’s development of generative artificial intelligence systems, according to four people familiar with the discussions.

The technology giant has floated multiyear deals worth at least $50 million to license the archives of news articles, said the people with knowledge of talks, who spoke on the condition of anonymity to discuss sensitive negotiations. The news organizations contacted by Apple include Condé Nast, publisher of Vogue and The New Yorker; NBC News; and IAC, which owns People, The Daily Beast and Better Homes and Gardens.


Update (2023-12-29): Jason Kint:

The complaint is a must-read imho, it’s the only way to understand the alleged violations and the extent as to which the systems have been designed and tuned in order to generate certain output.


So back to Exhibit J. Unlike the other 220k+ pages of exhibits documenting registered works, this exhibit contains 100 examples of alleged copyright violations with nearly identical content being outputted by ChatGPT. Again, it’s impossible to argue with this.

Here are four examples. Again, the lawsuit includes one hundred of them. You get the point. I find this exhibit to be an incredibly powerful illustration for a lawsuit that will go before a jury of Americans.

Update (2024-01-05): Gary Marcus (via Hacker News):

The crux of the Times lawsuit is that OpenAI’s chatbots are fully capable of reproducing text nearly verbatim[…]

The thing is, it is not just text. OpenAI’s image software (which we accessed through Bing) is perfectly capable of verbatim and near-verbatim repetition of sources as well.

Daniel Jeffries (via Hacker News):

The NY Times is asking that ALL LLMs trained on Times data be destroyed.

That includes GPT 3 and 4, Claude, Mistral, Llama/Llama 2 and pretty much any other model in existence.

Update (2024-01-09): Kate Downing (via Hacker News):

The complaint paints a picture of an honorable industry repeatedly pants-ed by the tech industry, which historically has only come to heel under enormous public pressure and the Herculean efforts of The Times to continue to survive. It’s interesting because US copyright law decisively rejects the idea that copyright protection is due for what is commonly referred to as “sweat of the brow.” In other words, the fact that it takes great effort or resources to compile certain information (like a phonebook), doesn’t entitle that work to any copyright protection – others may use it freely. And where there is copyrightable expression, the difficulty in creating it is irrelevant. So, is all this background aimed solely at supporting the unfair competition claim? Is it a quiet way of asking the court to ignore the “sweat of the brow” precedent, to the extent that it’s ultimately argued by the defendants, in favor of protecting the more sympathetic party? Maybe they’re truly concerned that the courts no longer recognize the value of journalism and need a history lesson? No other AI-related complaint has worked so hard to justify the very existence, needs, and frustrations of its plaintiffs.

Unless Microsoft and OpenAI hustle to strike a deal with the New York Times, this is definitely going to be the case to watch in the next year or two. Not only does it embody some of the strongest legal arguments related to copyright, it is likely to become a lightning rod for many interests who will use it to wage a proxy war on their behalf.

Update (2024-02-28): Blake Brittain (via Slashdot):

OpenAI said in a filing in Manhattan federal court, opens new tab on Monday that the Times caused the technology to reproduce its material through “deceptive prompts that blatantly violate OpenAI’s terms of use.”


“The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products.”

2 Comments RSS · Twitter · Mastodon

If Google Books scanning was ruled fair use, I can't see how OpenAI crawling isn't.

Appendix J of the lawsuit is mind-blowing. Have a look at it:

For instance prompting "LONDON — In Hungary, the prime minister can now rule" will literally dump the next 144 words of the NYTimes article.

If nothing else, it sure overfits its data.

Google Books are not available for all to read. If they were, fewer books would be produced, and the profession of author would disappear. I think that would be a bad thing and the various companies plagiarizing right now would have no new source materials.

The full lawsuit is here:

Leave a Comment