Wednesday, July 7, 2021

GitHub Copilot and Copyright

Rian Hunter (via Hacker News):

I do not agree with GitHub’s unauthorized and unlicensed use of copyrighted source code as training data for their ML-powered GitHub Copilot product. This product injects source code derived from copyrighted sources into the software of their customers without informing them of the license of the original source code. This significantly eases unauthorized and unlicensed use of a copyright holder’s work.

Julia Reda (tweet):

Since Copilot also uses the numerous GitHub repositories under copyleft licences such as the GPL as training material, somecommentators accuse GitHub of copyright infringement, because Copilot itself is not released under a copyleft licence, but is to be offered as a paid service after a test phase. The controversy touches on several thorny copyright issues at once. What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community.

[…]

In the US, scraping falls under fair use, this has been clear at least since the Google Books case.

[…]

The short code snippets that Copilot reproduces from training data are unlikely to reach the threshold of originality. Precisely because copyright only protects original excerpts, press publishers in the EU have successfully lobbied for their own ancillary copyright that does not require originality as a precondition for protection. Their aim is to prohibit the display of individual sentences from press articles by search engines.

[…]

On the other hand, the argument that the outputs of GitHub Copilot are derivative works of the training data is based on the assumption that a machine can produce works. This assumption is wrong and counterproductive. Copyright law has only ever applied to intellectual creations – where there is no creator, there is no work. This means that machine-generated code like that of GitHub Copilot is not a work under copyright law at all, so it is not a derivative work either.

Luis Villa:

“independent creation” is a doctrine in US law that protects you if you write the same thing without knowing about the first thing. May or may not apply here, but I mention it because it is non-intuitive and speaks directly to “but what if the code is the same”.

There is an observable trend in US law, based on fair use and older notions in US copyright law of the need for creativity, that judges give a looooot of leeway to “machines that read”. Copilot fits pretty squarely in that tradition.

[…]

Article 4 of the 2019 Directive seems to clearly make Copilot’s training unambiguously legal in the EU, but authors can explicitly opt out.

[…]

Note that this is an interesting example of what I wrote about in the context of databases, where rights are not the same across countries, making it hard to write a generic global license.

James Grimmelmann:

Almost by accident, copyright law has concluded that it is for humans only: reading performed by computers doesn’t count as infringement. Conceptually, this makes sense: Copyright’s ideal of romantic readership involves humans writing for other humans. But in an age when more and more manipulation of copyrighted works is carried out by automated processes, this split between human reading (infringement) and robotic reading (exempt) has odd consequences: it pulls us toward a copyright system in which humans occupy a surprisingly peripheral place. This Article describes the shifts in fair use law that brought us here and reflects on the role of robots in copyright’s cosmology.

[…]

Infringement is for humans only; when computers do it, it’s fair use.

Previously:

GitHub Support just straight up confirmed in an email that yes, they used all public GitHub code, for Codex/Copilot regardless of license.

Adam Jacob:

Those of us who remember when open source was the novel underdog, allowing us to learn, grow, and build things our proprietary peers could not - we tend to see the relationship to corp $ in OSS as a net benefit, pretty much always.

That’s because we remember when it wasn’t so, and it took a lot of work to make it legit. But if you started your career with that as the ground truth, you’re much more likely to see the problematic aspects of it; that your open code can be used by folks in ways you dislike.

The Free Software Foundation has received numerous inquiries about our position on these questions. We can see that Copilot’s use of freely licensed software has many implications for an incredibly large portion of the free software community. Developers want to know whether training a neural network on their software can really be considered fair use. Others who may be interested in using Copilot wonder if the code snippets and other elements copied from GitHub-hosted repositories could result in copyright infringement. And even if everything might be legally copacetic, activists wonder if there isn’t something fundamentally unfair about a proprietary software company building a service off their work.

With all these questions, many of them with legal implications that at first glance may have not been previously tested in a court of law, there aren’t many simple answers. To get the answers the community needs, and to identify the best opportunities for defending user freedom in this space, the FSF is announcing a funded call for white papers to address Copilot, copyright, machine learning, and free software.

15 Comments RSS · Twitter

Pretty crazy that they don't respect the licenses of the code they used as training data. I think they're even aware that they're doing something shady, they have a weird evasive answer to the "training data" question on their site. Regardless of legality, it's just stupid to do this.

IANAL, and I've seen suggestions that this may be covered by fair use.

Julia also makes a good "be careful what you wish for" point.

That said, legal or not,

1) I just find it _gross_ to take (snippets of) someone else's code without even attributing them. Like, I don't care if I'm allowed to. Just don't do that! This seems like something Copilot could mitigate by generating an ATTRIBUTIONS-COPILOT.md file.

2) I also find it counterproductive as a developer. When I paste code verbatim from a Stack Overflow answer, I try to make sure to prefix it with a comment that at least links the answer. The source control history lies in such cases (and with Copilot as well): it claims I wrote that. But I didn't. When encountering bugs / being confused by a behavior, I find that to be a highly useful piece of information: "oh yeah, I didn't come up with that at all; maybe it wasn't such a good fit for my app after all and I need to make adjustments or outright replace it". Again, Copilot could help resolve this by generating a comment above, pointing to the line in the attributions file.

Of course, as far as copyright and licenses are concerned, that only solves the attribution issue. GitHub presumably took basic steps to avoid code with particularly restrictive licenses, but more clarification would be appreciated.

"What is astonishing about the current debate is that the calls for the broadest possible interpretation of copyright are now coming from within the Free Software community."

I don’t think that’s astonishing. Especially the FSF has always supported strong copyright as a weapon against people stripping freedoms from their work.

I fully understand that they consider it immoral to take copyleft work, put it through a digital mixer like Bitcoin and then build unfree software from the building blocks.

Kevin Schumacher

Copilot is not intended to output verbatim code from its training set, so I’m not sure what attribution you’re looking for in the vast majority of cases. It is quite literally not supposed to be a copy and paste from SO. It’s supposed to generate new code based on your situation and what it has seen before. There’s a section on the Copilot website about the limited times it does regurgitate directly (about 0.1% of the time, IIRC). They’re working on making the tool identify times when that happens and then notifying the developer somehow of the source of it, as also noted on their site.

One question from a guy late to the party: does Copilot use source code only from public repositories, or also from private ones as well?

(Disclaimer: I am not a machine learning expert.)

> Copilot is not intended to output verbatim code from its training set

It doesn't _literally_ output verbatim code from a repo, but I don't see how it could possibly stray too far from that.

You start typing new code, and it looks at existing code from its training set that looks like a close match. This can't be _too_ fuzzy without producing code that flat-out doesn't compile or (worse?) behaves incorrectly at runtime. It might be smart about renaming symbols, but that doesn't really avoid plagiarism.

> It is quite literally not supposed to be a copy and paste from SO. It’s supposed to generate new code based on your situation and what it has seen before.

I fail to see how that distinction works in practice. At the end of the day, they may be taking five, ten, fifty implementations that look vaguely like what you're trying to achieve, and if those are all identical, they'll use it. It's not like they can average them without risking broken code.

@ Dragan: they claim public only.

https://copilot.github.com, scroll down to "Frequently asked questions", click "Training set". (Alas, no anchors that I can see.)

Old Unix Geek

In NLP, GPT-3 is trained by next word prediction: predict the next word given all previous words in the text. So it is trained to predict the existing text. Because of how the neural network is set up, it tends to learn common expressions rather than just printing out War And Peace verbatim. These expressions are relatively common and therefore learned as features of the Language space.

In the context of code, the predictions are probably somewhat abstracted (replacing specific variable names by a token "var1", "var2" for the N variables appearing in the function, and a prediction for common names for said tokens) rather than direct predictions of the next token. However the basic idea of training by predicting the next syntactic element should be the same.

If a construction occurs many times, GPT3 will learn it as a feature of the landscape of programming, just as it will learn common expressions in English. The Quake example has been copied verbatim by many people (most likely because few people understand it). It's also odd, in that there's not much code that resembles it. Because it occurs identically many times, it is learned to be an atom, a thing of low entropy, just like expressions such as "the rain in Spain stays mainly on the plain".

Most low-entropy things are of low originality (oh, look, yet another for loop. Oh, look, yet another cliché). That might be why Github thought it would not be a problem to use GPL'd code in their training set. However this is not always the case: a pattern that reappears in many GPL'd code bases might have required an expert to craft it. The fact it reappears often comes from the fact that the GPL is designed to allow people to share such code with other GPL'd code-bases.

Github also has a problem. They probably need to use GPL'd code to create a large enough training set to train their tool. GPT-3 was trained on something like 2Tb of English text downloaded from the internet. These algorithms need a lot of data. That's probably why Github just grabbed anything in a public repository.

Copilot is not intended to output verbatim code from its training set

It is trained to output code verbatim from its training set. Its architecture is supposed to prevent it outputting large sections of code verbatim from its training set. But algorithms like GPT-3 are known to output "common expressions".

I don't know what the intentions of the people who worked on Copilot were. I also don't care. What matters is what the tool does.

The fact that this type of algorithm would output code verbatim was known beforehand, and it seems very unlikely that Github's engineers were unaware of that fact.

Robert Berger

It may not be illegal, but Copilot, being owned by Microsoft, is closed source, only available on VS Studio, and without any way to have your code opt-out; Copilot is immoral.

IMHO, If Copilot was Open Sourced, the ML corpus and models publicly available and not tied to VS Studio with a community governance, it could be a good thing.

Old Unix Geek

A paper just came out on how CoPilot works:

https://arxiv.org/abs/2107.03374

I have yet to read it fully.

It mentions training is to predict the next-token, and that if the preceding code is broken, the algorithm will tend to predict buggy code to follow, which is amusing.

It also states that "the generated code consisted of common expressions or conventions within the programming language that appeared over and over again in the training data". Apparently their evaluation did not notice entire chunks of code being copied such as the Quake excerpt, which suggests the evaluation was lacking.

CoPilot's evidence this is legal is testimony from their OpenAI's lawyer about GPT-3 in an NLP context: https://perma.cc/ZS7G-2QWF I find such an argument weak in that language is very different from source-code. If it weren't, everyone could program, and so many people wouldn't use Stack-overflow as a crutch. Indeed, the very reason for this tool's existence is that programming isn't as trivial as talking.

Kevin Schumacher

I note that you left off both the sentence prior to what you quoted, as well as the words "In the rare instances" at the beginning of the sentence you did quote, which make clear that the quote you ran with (because it certainly seems much worse the way you quoted it, as if all of the generated code was verbatim from training data) is in reference to 0.1% of nearly 500,000 suggestions in the evaluation set.

> Apparently their evaluation did not notice entire chunks of code being copied such as the Quake excerpt

Given that the Quake excerpt is written in C, and the evaluation that is being referenced in your quote was limited to Python, it would be pretty impressive/spectacularly awful if the Quake code was a suggestion that occurred in the evaluation set.

Additionally, my understanding is the Quake example is on shaky copyright ground, anyway, because it is essentially a math formula, and math formulas cannot be copyrighted. Also, the code in question was not originally created by the company who GPLed it, which means even if it could be copyrighted, the license that is being claimed to apply to it does not, because you cannot enforce a license to something which you do not own or otherwise have delegated rights to regarding the copyright of it.

More to the point, the evaluation (which it's unclear if you've bothered reading, even though I linked it in a prior discussion, though it seems that you have not given the statements you keep making) does note that the more the system has seen something, the more likely it is to repeat it verbatim. It even shows that CoPilot can be made to regurgitate The Zen of Python wholesale. (Which is in the public domain, by the way.) It also suggested a GPL license when starting with a blank file.

While not directly addressing the Quake example, that does explain why it would be presented--the Quake code has been widely disseminated and would be likely to appear often in the training corpus. And a huge part of the reason it's been widely disseminated is it is considered to be a standard way of doing that thing performantly. Sometimes, there is only so many ways to express a particular bit of logic.

---

You can find the argument weak all you want, but the current legal consensus (more generally, not even specifically for CoPilot) is that humans have to be pretty directly involved for a work to be sufficiently creative to attract copyright protection or to create what copyright law deems a derivative of another work. This is well-established in the US through the courts, and seems to be on its way to becoming actual law in Europe.

Human languages being different from programming languages is a red herring. You don't start hearing source code while you're still in utero, constantly surrounded by it as you grow up and learn to speak, so of course nobody is fluent in it from day one, and many people never learn it. Programming languages are languages, though (almost like the name is there for a reason). And just as with spoken language, some are more inscrutable than others.

There are two much better analogies: a person who grew up without other humans around, or a person learning a foreign language.

In the latter case, one can choose to pursue learning a foreign language, just as one can choose to pursue learning to program. If the human language is sufficiently different from their own (such as Japanese or Russian to an English speaker), it can be just as foreign as learning C or Ruby.

In the former case, which I think is a much closer fit, the person can learn to speak later in life (whether that's later childhood or as an adult), just as a person can learn to program (ditto).

There is a writing SO site, as well as for various languages, including two different ones for English (one aimed at ESL and one for the language and usage of it). People use these to ask questions that lead to a better understanding of how a language works, nuances of it, meanings of words, and so forth. Just as people use SO proper to ask questions about how a programming language works; nuances of it; meanings of functions, constants, etc.; and so forth. Your artificial distinction here makes no sense.

Old Unix Geek

I so appreciate being educated by the ignorant, Kevin. Thank you.

No, the Quake code is not a math formula. It's a hack based on the IEEE representation of floating point numbers. If this were unpatentable, so would be the hardware patents I obtained in a previous life. It's clearly copyrightable too.

Brain scans show a different part of the brain lights up when programming, versus when speaking a language. So no, programming languages are not remotely analogous to natural languages. Anyway my point was that the perplexity of language is much greater than the perplexity of code. https://en.wikipedia.org/wiki/Perplexity

You are correct, that the paper says 0.1% of code generated by this version of Codex was copied verbatim, and this version of Codex only applies to Python code. However the description of Github Copilot makes the same claim, and Copilot is supposed to generate C & C++. They both cite the same study: https://docs.github.com/en/github/copilot/research-recitation

Finally, I find it amusing that the "legal analysis" paper they quote, says that the court should not worry about infringement because those who do not wish their content used for this, can use a "robots.txt" file. I would suggest the GPL which clearly indicates that people don't want their work used in this manner. Bait and switch.

But whatever. Good luck in your brave new world, Kevin. Enjoy it. You deserve it. I shall leave you in peace in your ignorance.

I appreciate the sharing of different perspectives here, but no personal attacks, please.

Old Unix Geek

You're right Michael. I should not let my limbic system go anywhere near the keyboard. No doubt Kevin is very knowledgeable on many (other) topics, and I am likely quite ignorant about some of them.

For my part, I consider calling people's contributions "rants", talking down to others, and straw-manning other points of views to also be attacks. It has put me off commenting.

Anyway, thank you for your forum. It has been nice to have a voice here.

Yes, I’ve been uncomfortable with some of the replies to you, too. It is hard to draw a line, but I hope that everyone will try to make others feel welcome and respected even when they disagree.

Leave a Comment