Why I'm Ditching Github, or Microsoft Loves Open-Source the Way Hans Reiser Loves His Wife

I'm not a particularly big fan of how intellectual property (hereafter referred to as IP) works in 21st-century America. Copyright is probably the most egregious form of IP; in the United States a copyright lasts until 70 years after the author of the copyrighted work dies. It was originally meant to be a lot shorter but corporations have used their lobbying power with congress to extend it several times so that they can continue to profit for a century off of old works that eventually become ingrained in American culture to the point that the copyright owner effectively has a monopoly on a piece of history.

Copyright is a system that almost exclusively benefits large megacorporations like Microsoft, Disney, Sony, Warner Bros, Google, etc. It's become a major point of contention between individuals and coporations ever since high-speed internet connections made it possible to transmit flawless duplicates of music, movies, books and video games to the other side of the planet in hours, minutes or just seconds. For at least a quarter of a century, media corporations have aggressively defended their own copyrights in just about every way imaginable: multimillion lawsuits against people who use filesharing programs, advanced DRM systems incorporating complicated encryption schemes and sometimes even kernel modules that spy on users, and laws like the DMCA which actually make it a crime to circumvent the aforementioned DRM.

Some of these corporations (especially Microsoft and Google) are engaged in cutting-edge Machine-Learning/AI research which has very recently made some absolutely incredible breakthroughs which have led to new technologies which seemed like a science-fiction pipedream a mere half a decade ago. These systems process extremely large sets of data to find patterns in language, art and sound which are so large and complicated that no human would ever be capable of understanding them, and using these patterns they can produce beautiful works of art, and intelligent conversation based on simple english-language prompts from users with no experience programming computers. One promising application of this technology is Microsoft's new copilot1, a service powered by machine-learning which enhances a programmer's productivity by scanning code as he types it, understanding what the code is going to do, and finishing it faster than the programmer can think.

It is because of copilot that Microsoft has found itself in a very surprising situation, one in which the copyright laws that ordinarily protect it from individuals are instead protecting individuals from Microsoft. Surprisingly (or perhaps unsurprisingly), Microsoft has taken a stance on copyright opposite to the stance it takes when Windows, Office, XBox, or any of its other products is concerned: they really hate it and would rather it just go away.

It takes a lot of data to train something as complicated as copilot, so Microsoft turned to what is probably the largest repository of source code in the world. In 2018 they acquired github, a company which provides free git repository hosting to open-source projects. Most of these projects will come with a software license which is stored in a file with a name such as COPYING, LICENSE, README, or something of that sort. This is a long tradition with the open-source community which is older than I am; the authors of open-source projects will usually include a license with the project which dictates how the project can be used and distributed to third parties. Underlying this license is the concept of copyright: the authors hold the copyright to the code and that is what gives them the authority to list the terms which users agree to when they use or distribute the project.

The requirements of each license vary, but one of the most common requirements is to reproduce the copyright statement and/or a copy of the license itself whenever source code from the project is used. This is true of the GNU GPL that WashingtonDC/washDC is currently licensed under, as well as the BSD license it previously used and many of the other more permissive licenses such as MIT and zlib licenses. This is a problem for copilot because it has a tendency to reproduce code from its training set 2 , 3 and it does not even try to comply with licenses that require it to include copyright statements and copies of the license. You can find a comprehensive listing of several open-source licenses on choosealicense.com; notice how many of them have this requirement.

Most of these licenses don't have any clauses that prohibit code from being used as training data for a machine-learning system, so Microsoft is actually in the clear to use most of the code on github as long as they obey the other terms of the license. However, Microsoft has chosen not to obey the terms of those licenses. They do not attribute code to its original author, nor do they include any form of attribution to make it clear where copilot's code is coming from. Instead they present it as if it was generated by copilot entirely on its own. Users are not given any indication of where the code comes from or what terms it is distributed under; this will undoubtedly lead to open-source code being stolen by copilot's users.

As I stated at the beginning of this article, Microsoft is a fierce supporter of copyright when it's their copyright that's on the line. It is extremely hypocritical of them to blatantly disregard the IP rights of open-source projects like this. They're currently being sued over this, and their response4 was long and bizarre, including such claims that "Copilot withdraws nothing from the body of open source code available to the public", which contradicts the statements they made on their own website5. Most infuriatingly, they're even trying to argue that requiring them to adhere to open-source licenses would somehow "undermine open source principles"4.

I hope the plaintiffs in that suit are successful. I don't have any intentions to file a lawsuit because that would be time-consuming and expensive, and i'd rather spend my resources recovering from cancer and developping my Dreamcast emulator. Instead I'm going to be moving WashingtonDC/washDC from github to one of its competitors. This doesn't necessarily prevent it from being used as training data for copilot or from having its license violated by Microsoft but at least this way I won't be complicit. I really wouldn't be surprised if somewhere down the line Microsoft tried to argue that anybody who hosts their code on github while knowing that its license may be violated by copilot is tacitly giving them permission to do so. I'm not saying that I think that's a viable argument (because it's obviously not), but I am saying that the greedy bastards at Microsoft will probably try to make that argument at some point in the future.

This is a problem that's only going to get bigger as time goes on. The successes of the Machine Learning/AI community are nothing short of groundbreaking and I expect it will revolutionize several industries in very little time. The Machine Learning community (and ESPECIALLY the megacorps like Microsoft) need to extend common courtesy to the people upon whose work they rely. If something you made is used to train an AI model, then you have contributed to that model's creation and you deserve a say in how your content is used.

As I said at the beginning of this article, I actually don't like copyright or IP in general. However, my not wanting something to exist doesn't make it go away. Corporations like Microsoft just spent the last two decades using copyright to stop people from sharing files, to stop users from making modifications to computers and software they bought, and even to stop users from merely having access to out-of-print materials. If they hadn't just spent over two decades abusing copyright like that, I don't think I'd have a problem with them training AIs like copilot on my emulator. If Microsoft was going to use its lobbying power to reform America's copyright system and strengthen fair use doctrines, I think I might be pretty amenable to them using my source code as training material for their AI models. But That's not the reality we live in. Nobody's trying to fix copyright. They just want it to not apply to them in this narrow instance because, for the first time ever, copyright is working against them instead of for them.

If Microsoft gets its way, you still won't be able to legally share copies of XBox games, Windows, Office, etc with your friends online. If Microsoft gets its way, you still won't be allowed to legally reverse-engineer the DRM that they ship on their software. If Microsoft gets its way, you still won't legally be able to download old software that they even don't sell anymore. They don't want to fix copyright, they just want a special exemption from it.

I think it's time for newer versions of open-source licenses to explicitly address this problem by outlining what rights and responsibilities corporations have if they want to train their machine learning models on open-source software. GPLv3 was created largely in response to Tivo selling DVRs which ran open-source software that users weren't allowed to replace or modify. What's happening with copilot is far more egregious than anything Tivo ever did. I'm just a lone emulation developer with no authority to make this happen, but if anybody reading this happens to have pull with GNU I really hope you'll convince The Powers that Be to get the ball rolling on a GPLv4 that addresses copilot and other "GPL Violations as a Service" platforms that think they deserve a special exemption because they're based on machine learning technology.

With regards to WashingtonDC/washDC itself, nothing has changed except the name being changed to washDC (this name change is unrealted to the copilot fiasco, but I've been contemplating it for several years and as long as I'm moving the source hosting I might as well finally change the name). WashingtonDC/washDC is still open-source software, and it is still available under the terms of the GNU GPL. The official website of the project is still http://www.washemu.org. I will continue to support Microsoft Windows as a platform even though it's made by the same company as copilot. The only thing that's changing is that the official upstream git repo will be hosted at https://gitlab.com/washemu/washdc.

Footnotes:

5

from https://github.com/features/copilot/ "Our latest internal research shows that about 1% of the time, a suggestion may contain some code snippets longer than ~150 characters that matches the training set."; this would seem to be an admission that it does happen even if they're trying to minimize it.