The Era of the Slopfork
I am a frequent and happy user of open-source code. I started using Linux in the mid-90s and have never looked back. I love the ethic of community and sharing that I find in Linux, GNU, and other open-source software. My chosen field would almost certainly not exist if open-source software did not exist, and I absolutely would never have had the chance to learn all I have over the years if open-source software did not exist.
I am also very interested in AI. I think there is tremendous potential in AI to make me more efficient. I have enjoyed finding the ways that it can make my life easier. I see it as a tool like any other and I think, after some growing pains, we will find a way to live with it as we have every technology developed in human history.
Recently, however, I have been watching with concern one of those growing pains, and I am increasingly worried about what it means for the long-term viability of open-source software. It is a variant of a frequent issue seen in AI but aimed very specifically at code.
I think most people are familiar with the issue of copying artists with AI. An image generator will happily comply if you ask it to generate a Picasso painting of your cat or a selfie in the style of Miyazaki. There is a healthy debate over the copyright implications of that sort of thing. If I clone your art style with AI, have I violated your copyright? Do I own the copyright on my derived work? What if I trained my AI specifically on your work?
In the world of open-source, though, there is a deeper problem. The reason licenses like the GPL work to force people to release their code is that re-implementing that code is more work than people are willing to do. If I have to choose between implementing Postgres and open-sourcing my small tweaks, it essentially never makes sense to re-implement all of it.
Recently, the maintainer of the Python library chardet, Dan Blanchard, completely re-implemented chardet and changed the license to 0BSD from the LGPL. This in spite of the public objection by the original author Mark Pilgrim. He did so using Claude to perform a "clean-room implementation." He argues that this means that his new implementation is not subject to the original copyright and that he need not get approval from the original author.
However, the purpose of clean-room methodology is to ensure the resulting code is not a derivative work of the original. It is a means to an end, not the end itself. In this case, I can demonstrate that the end result is the same — the new code is structurally independent of the old code — through direct measurement rather than process guarantees alone.
— Dan Blanchard
This argument seems very fishy to me. It is certainly the case that Claude had ingested the original code from chardet when it was trained. The whole meaning of "clean-room" here means that the operation was conducted without any exposure to the original code. Here, not only was Claude exposed to the code, it was explicitly used to train. That's an awfully dirty clean-room.
This is troubling because it illustrates the severe blow that may come to open-source as a result of LLM ingestion of open-source code. Since you can't open up the AI and say "here's the code for my library," users of the AI can duplicate open-source libraries cheaply and release them under any license they like. They can smuggle code out of open-source and never need to release anything. Would a company like Redhat really be releasing source if they didn't have to? In 10 years, will Redhat have an ABI compatible kernel reimplemented in a "clean-room?"
To even further nail this point home, we now have meme sites like MALUS promising "Clean Room as a Service." See this article from Plagiarism Today for more discussion on the topic. I think it is a sign of the times that I really had to read the site to tell that MALUS was really just a joke. How long until the real MALUS shows up?
We are going to have to find a way to cope with the copyright implications of the way AI works. This way of copying doesn't seem quite like a xerox machine but it also doesn't seem like fair use. I'm not sure what the long-term implications of these kinds of issues will be but I feel pretty safe saying we are in for a bumpy few years.