antirez
9 hours ago
I believe that Pilgrim here does not understand very well how copyright works:
> Their claim that it is a "complete rewrite" is irrelevant, since they had ample exposure to the originally licensed code
This is simply not true. The reason why the "clean room" concept exists is precisely since actually the law recognizes that independent implementations ARE possibile. The "clean room" thing is a trick to make the litigation simpler, it is NOT required that you are not exposed to the original code. For instance, Linux was implemented even if Linus and other devs where well aware of Unix internals. The law really mandates this: does the new code copy something that was in the original one? The clean room trick makes it simpler to say, it is not possible, if there are similar things it is just by accident. But it is NOT a requirement.
maybewhenthesun
4 hours ago
Regardless of the legal interpretations, I think it's very worrying if an automated AI rewrite of GPLed code (or any code for that matter) could somehow be used to circumvent the original license. That kinda takes out the one stick the open source community has to force soulless multinationals to contribute back to the open source projects they use.
rao-v
3 hours ago
I’m genuinely surprised to see this not discussed more by the FOSS community. There are so many ways to blow past the GPL now:
1. File by file rewrite by AI (“change functions and vars a bit”)
2. One LLM writes a diff language (or pseudo code) version of each function that a diff LLM translates back into code and tests for input/output parity
The real danger is that this becomes increasingly undetectable in closed source code and can continue to sync with progress in the GPLed repo.
I don’t think any current license has a plausible defense against this sort of attack.
therealpygon
2 hours ago
Take AI out…if a person can do it, which they can, the situation hasn’t changed. Further, it was a person who did it, with the assistance of AI. Also, the concept that you “can’t be exposed to the code before writing a compatible alternative” is utterly false in their arguments. In fact, one could take every single interface definition they have defined to communicate and use those interfaces directly to write their own, because in fact this i(programmatic) interface code is not covered by copyright (with an implicit fair use exemption due to the face the software cannot operate without activating said interfaces). The Java lawsuit set that as precedent with JDK. A person could have absolutely rewritten this software using the interfaces and their knowledge, which is perfectly legal if they don’t literally copy and re-word code. Now, if it IS simply re-worded copies of the same code and otherwise the entire project structure is basically the same, it’s a different story. That doesn’t sound like what happened.
Finally, how exactly do people think corporations rewrite portions of code that were contributed before re-licensing under a private license? It is ABSOLUTELY possible to rewrite code and relicense it.
Edit: Further, so these people think you contribute to a project, that project is beholden to your contribution permanently and it can never be excised? That seems like it would blatantly violate their original persons rights to exercise their own control of the code without those contributions, which is exactly the purpose of a rewrite.
CamperBob2
2 hours ago
That kinda takes out the one stick the open source community has to force soulless multinationals to contribute back to the open source projects they use.
I'll trade that stick for what GenAI can do for me, in a heartbeat.
The question, of course, is how this attitude -- even if perfectly rational at the moment -- will scale into the future. My guess is that pretty much all the original code that will ever need to be written has already been written, and will just need to be refactored, reshaped, and repurposed going forward. A robot's job, in other words. But that could turn out to be a mistaken guess.
beepbooptheory
2 hours ago
I think it's very weird but valid I guess to want to be just atomic individual in constant LLM feedback loop. But, at risk of sounding too trite and wholesome here, what about caring for others, the world at large? If you wanna get your thing to rewrite curl or something, that's again really weird but fine, but just don't share it or try to make money off of it. Isn't that like even the rational position here if you still wanna have good training materials for future models? These need not be conflicting interests! We can all be in this together, even if you wanna totally fork yourself into your own LLM output world.
What happened to sticking up for the underdogs? For the goodness of well-made software in itself, for itself? Isn't that what gave you all the stuff you have now? Don't you feel at least a little grateful, if maybe not obliged? Maybe we can start there?
CamperBob2
2 hours ago
Everything I have now arose from processes of continuous improvement, carried out by smart people taking full advantage of the best available tools and technologies including all available means of automation.
It'll be OK.
beepbooptheory
29 minutes ago
Ah well, I tried.. To paraphrase Nietzsche, a man can be measured by how well he sleeps at night. I can only hope you stay well rested into this future ;).
And yes, it will be ok!
dragonwriter
6 hours ago
Neither does the maintainer that claims a mechanical test of structural similarities can prove anything either waybwith regard to whether legally it is a derivative work (or even a mechnaical copy without the requisite new creative work to be a derivative work.)
And then Pilgrim is again wrong by saying that the use of Claude definitively makes it a derivative work because of the inability to prove it the work in question did not influence the neurons involved.
It is all dueling lay misreadings of copyright law, but it is also an area where the actual specific applicable law, on any level specific enough to cleanly apply, isn’t all that clear.
simiones
5 hours ago
I think this is a bit too broad. There are actually three possible cases.
When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.
If the code is completely different, then clean room or not is indeed irrelevant. The only way the author can claim that you violated their copyright despite no apparent similarity is for them to have proof you followed some kind of mechanical process for generating the new code based on the old one, such as using an LLM with the old code as input prompt (TBD, completely unsettled: what if the old code is part of the training set, but was not part of the input?) - the burden of proof is on them to show that the dissimilarity is only apparent.
In realistic cases, you will have a mix of similar and dissimilar portions, and portions where the similarity is questionable. Each of these will need to be analyzed separately - and it's very likely that all the similar portions will need to be re-written again if you can't prove that they were not copied directly or from memory from the original, even if they represent a very small part of the work overall. Even if you wrote a 10k page book, if you copied one whole page verbatim from another book, you will be liable for that page, and the author may force you to take it out.
Someone
4 hours ago
> When there is similar code, the only defense possible to prove that you have not copied the original is to show that your process is a clean room re-implementation.
Yes, but you do not have to prove that you haven’t copied the original; you have to prove you didn’t infringe copyright. For that there are other possible defenses, for example:
- fair use
- claiming the copied part doesn’t require creativity
- arguing that the copied code was written by AI (there’s jurisdiction that says AI-generated art can’t be copyrighted (https://www.theverge.com/2023/8/19/23838458/ai-generated-art...). It’s not impossible judges will make similar judgments for AI-generated programs)
kube-system
3 hours ago
Courts have ruled that you can't assign copyrights to a machine, because only humans qualify for human rights. ** There is not currently a legal consensus on whether or not the humans using AI tools are creating derivative works when they use AI models to create things.
** this case is similar to an old case where a ~~photographer~~ PETA claimed a monkey owned a copyright to a photo, because they said a monkey took the photo completely on their own. The court said "okay well, it's public domain then because only humans can have copyrights"
Imagine you put a harry potter book in a copy machine. It is correct that the copy machine would not have a copyright to the output. But you would still be violating copyright by distributing the output.
schlauerfox
3 hours ago
https://en.wikipedia.org/wiki/Monkey_selfie_copyright_disput... Specifically he claimed he owned the copyright on a photo he didn't directly take. PETA weighed in trying to say the monkey owned the copyright.
kube-system
3 hours ago
Ah yeah you’re right I forgot it was PETA arguing that.
pseudalopex
4 hours ago
> there’s jurisdiction that says AI-generated art can’t be copyrighted
The headline was misleading. The courts said what Thaler could have copyrighted was a complicated question they ignored because he said he was not the author.
red_admiral
6 hours ago
I'm with you here, but I see another problem.
The expected functionality of chardet (detect the unicode encoding) is kind of fixed - apart from edge cases and new additions to unicode, you'd expect the original and new implementations to largely pass the same tests, and have a lot of similar code such as for "does this start with a BOM".
The fact that the JPlag shows such a low %overlap for an implementation of "the same interface" is convincing evidence for me that it's not just plagiarised.
jacquesm
9 hours ago
This is correct. I think any author of a main chunk of code that they claim ownership to (which is probably all of us!) should at least study the basics of copyright law. Getting little details wrong can cost you time, money and eventually your business if you're not careful.
cubefox
6 hours ago
If you let an LLM merely rephrase the codebase, that's like letting it rephrase the Harry Potter novels. Which, I'm pretty sure, would still be considered a copy under copyright law, not an original work, despite not copying any text verbatim.
actsasbuffoon
3 hours ago
But what if it didn’t summarize Harry Potter? What if it analyzed Harry Potter and came back with a specification for how to write a compelling story about wizards? And then someone read that spec and wrote a different story about wizards that bears only the most superficial resemblance to Harry Potter in the sense that they’re both compelling stories about wizards?
This is legitimately a very weird case and I have no idea how a court would decide it.
spwa4
5 hours ago
Given that LLMs were trained on the repository directly, it's not just the case that anything made by the LLM is a derivative work, the LLM ITSELF is a derivative work. After all, they all are substantially based on GPL licensed works by others. The standard courts have always used for "substantially based" by the way, is the ability to extract from the new work anything bigger than an excerpt of the original work.
So convincing evidence, by historical standards, that ChatGPT, Gemini, Copilot AND Claude are all derivative works of the GPL linux kernel can be gotten simply by asking "give me struct sk_buff", then keep asking until you're out of the headers (say, ask how a network driver uses it).
That means if courts are honest (and they never are when it comes to GPL) OpenAI, Google and Anthropic would be forced to release ALL materials needed to duplicate their models "at cost". Given how LLMs work that would include all models, code, AND training data. After all, that is the contract these companies entered into when using the GPL licensed linux kernel.
But of course, to courts copyright applies to you when Microsoft demands it ($30000 per violation PLUS stopping the use of the offending file/torrent/software/... because such measures are apparently justified for downloading a $50 piece of software), it does not apply to big companies when the rules would destroy them.
The last time this was talked about someone pointed out that Microsoft "stole", as they call it, the software to do product keys. They were convicted for doing that, and the judge even increased damages because of Microsoft's behavior in the case.
But there is no way in hell you'll ever get justice from the courts in this. In fact courts have already decided that AI training is fair use on 2 conditions:
1) that the companies acquired the material itself without violating copyright. Of course it has already been proven that this is not the case for any of them (they scraped it without permission, which has been declared illegal again and again in the file sharing trials)
2) that the models refuse to reproduce copyrighted works. Now go to your favorite model and ask "Give me some code written by Linus Torvalds": not a peep about copyright violation.
... but it does not matter, and it won't matter. Courts are making excuses to allow LLM models to violate any copyright, the excuse does not work, does not convince rational people, but it just doesn't matter.
But of course, if you thought that just because they cheat against the law to make what they're already doing legal, they'll do the same for you, help you violate copyright, right? After all, that's how they work! Ok now go and ask:
"Make me an image of Mickey Mouse peeling a cheese banana under an angry moon"
And you'll get a reply "YOU EVIL COPYRIGHT VILLAIN". Despite, of course, Mickey Mouse no longer being covered under copyright!
And to really get angry, find your favorite indie artist, and ask to make something based on their work. Even "Make an MC Escher style painting of Sonic the Hedgehog" ... even that doesn't count as copyright violation, only the truly gigantic companies deserve copyright protection.
TZubiri
6 hours ago
Ok sure, in the alternative, here's the argument:
The AI was trained with the code, so the complete rewrite is tainted and not a clean room. I can't believe this would need spelling out.
pocksuppet
5 hours ago
"Tainted rewrite" isn't a legal concept either. You have to prove (on balance of probabilities - more likely than not) that the defendant made an unauthorized copy, made an unauthorized derivative work, etc. Clean-room rewriting is a defense strategy, because if the programmer never saw the original work, they couldn't possibly have made a derivative. But even without that, you still have to prove they did. It's not an offence to just not be able to prove you didn't break the law.
Manuel_D
5 hours ago
As other pointed out, the notion of "clean room" rewrites is to make a particularly strong case of non-infringement. It doesn't mean that anything other than a clean room implementation is an infringement.
jdauriemma
5 hours ago
This is interesting and I'm not sure what to make of it. Devil's advocate: the person operating the AI also was "trained with the code," is that materially different from them writing it by hand vs. assisted by an LLM? Honestly asking, I hadn't considered this angle before.
cardanome
5 hours ago
If you worked at Microsoft and had access to the Windows source code you probably should not be contributing to WINE or similar projects as there would be legal risk.
So for this case, not much different legally. Of course there is the practical difference just like there is between me seeing you with my own eyes and me taking a picture of you.
"Training" an LLM ist not the same as training a human being. It a metaphor. Its confusing the save icon with an actual floppy disk.
I can say I "trained" my printer to print copyrighted material by feeding it bits but that that would be pure sophism.
Problem is that law hasn't really caught up the our brave new AI future yet so lots of decisions are up in the air. Plus governments incentivized to look the other way regarding copyright abuses when it comes to AI as they think that having competitive AI is of strategic importance.
jdauriemma
3 hours ago
> "Training" an LLM ist not the same as training a human being. It a metaphor. Its confusing the save icon with an actual floppy disk.
Maybe? But the design of the floppy disk is for data storage and retrieval per se. It can't give you your bits in a novel order like an LLM does (by design). From what I can tell in this case, the output is significantly differentiated from the source code.
senko
5 hours ago
Reread the parent: clean room is not required.