Tuesday, July 11, 2023

Suing OpenAI and Meta for Copyright Infringement

Wes Davis (Hacker News):

Comedian and author Sarah Silverman, as well as authors Christopher Golden and Richard Kadrey — are suing OpenAI and Meta each in a US District Court over dual claims of copyright infringement.

The suits alleges, among other things, that OpenAI’s ChatGPT and Meta’s LLaMA were trained on illegally-acquired datasets containing their works, which they say were acquired from “shadow library” websites like Bibliotik, Library Genesis, Z-Library, and others, noting the books are “available in bulk via torrent systems.”

Previously:

3 Comments RSS · Twitter · Mastodon

Old Unix Geek

They should make it into a class action suit...

How does one actually demonstrate that a particular copyrighted work was used training an AI?

Has anyone actually made a cogent argument for why AI like Stable Diffusion are committing copyright infringement? The piece itself is not actually contained within the AI's model, as far as I understand. And there's nothing stopping, for example, a human artist from producing a new piece of art that bears a lot of resemblance to an existing piece. In fact this is done all the time.

These are not loaded questions. I'm genuinely curious if anyone has done a good job addressing them. I could be swayed in either direction on this debate, though at this point I'm more leaning on the side of AI not being copyright infringement being that it neither contains nor will produce copyrighted work, at least in any of the models I'm familiar with.

Old Unix Geek

Pedro Domingos proved that neural networks interpolate between the examples they were trained on. It's a bit more complicated than that because it also depends on the random initialization of the weights of the neural network.

So, all the knowledge that is used came from various sources.

Apparently there is evidence that LibGen was used for training ChatGPT. LibGen is full of copyrighted books. If people had not written those books, the large LLMs would not be able to answer your questions. Unlike Google, which at least points to the book the search query referenced, LLMs answer without attribution. As LLMs get better, fewer people will read books, killing the market for them (which will make recent information even harder to find since the LLMs will not have fresh information for their training).

What is copyright? It is a mechanism to ensure creators get paid for their work. It was introduced after the introduction of the printing press because early printing companies simply found books (copied by hand) which were popular and sold them without paying the authors. It is not a "natural right" in the same sense as the right of a person to be free for instance. It is a mechanism, without which society would lose the productive capacity of authors and other creators, which would be a net loss for it. In other words, it is a mechanism that help societies which have it to outcompete other societies which don't. There's a reason Michael can run this website, and it's that the software he writes costs money. In a society in which he couldn't charge for his software, Michael wouldn't have his business, and would instead be doing something else with his time.

Is copying a single movie illegal? Yes, even if I transform it with a lossy compressor and an encryption algorithm. Is copying a single book illegal? Yes, even if I change its representation into an ePub or a PDF. Copyright is a bit like a color. Changing the computer representation of the copyrighted work does not bleach the color away. This is important because piracy could destroy authors' livelihoods since millions of copies of their work get distributed. (Whether it does is a bit of a question since the kinds of people who pirate tend to buy books and movies anyway, but writing books is not exactly lucrative these days). Since LLMs interpolate between examples they have seen, the same "bits have color" argument should apply to them too, although they will have a very complex color since they interpolate between all sorts of things.

Humans are different. Among other things, they have bad memories. So there are very few of them who remember copyrighted works word for word. If you get information from a human, it might have misunderstood. Also humans don't talk to millions of users simultaneously (like LLMs can) unless they are reading out a book aloud on youtube (which is a copyright violation). So they do not interfere with the copyright mechanism people developed. And, again, remember that copyright is about improving the diffusion of information through a society to let it outcompete another society. If people could not speak about what they learned from a book that would defeat the purpose of copyright.

Copyright is not perfect. Like any system of rules it can be gamed by the powerful. For instance, the notion that copyright should outlive the author is rather weird, and seems to me to go against the original purpose of encouraging creation. (The successful author's children have no impetus to contribute to the pot of shared knowledge themselves).

There is a website (I've forgotten its name, sorry) which lets artists determine whether their works were used to train the image generation "AI"s that are all the rage. I'm not aware of one for text yet, but it may come.

Leave a Comment