hackernews client

Relicensing with AI-Assisted Rewrite

344 pointsposted 16 hours ago

337 Comments

calny

6 hours ago

The maintainer's response: https://github.com/chardet/chardet/issues/327#issuecomment-4...

The second part here is problematic, but fascinating: "I then started in an empty repository with no access to the old source tree, and explicitly instructed Claude not to base anything on LGPL/GPL-licensed code." Problem - Claude almost certainly was trained on the LGPL/GPL original code. It knows that is how to solve the problem. It's dubious whether Claude can ignore whatever imprints that original code made on its weights. If it COULD do that, that would be a pretty cool innovation in explainable AI. But AFAIK LLMs can't even reliably trace what data influenced the output for a query, see https://iftenney.github.io/projects/tda/, or even fully unlearn a piece of training data.

Is anyone working on this? I'd be very interested to discuss.

Some background - I'm a developer & IP lawyer - my undergrad thesis was "Copyright in the Digital Age" and discussed copyleft & FOSS. Been litigating in federal court since 2010 and training AI models since 2019, and am working on an AI for litigation platform. These are evolving issues in US courts.

BTW if you're on enterprise or a paid API plan, Anthropic indemnifies you if its outputs violate copyright. But if you're on free/pro/max, the terms state that YOU agree to indemnify THEM for copyright violation claims.[0]

[0] https://www.anthropic.com/legal/consumer-terms - see para. 11 ("YOU AGREE TO INDEMNIFY AND HOLD HARMLESS THE ANTHROPIC PARTIES FROM AND AGAINST ANY AND ALL LIABILITIES, CLAIMS, DAMAGES, EXPENSES (INCLUDING REASONABLE ATTORNEYS’ FEES AND COSTS), AND OTHER LOSSES ARISING OUT OF … YOUR ACCESS TO, USE OF, OR ALLEGED USE OF THE SERVICES ….")

DrammBA

4 hours ago

Also the maintainer's ground-up rewrite argument is very flimsy when they used chardet's test-data and freely admit to:

> I've been the primary maintainer and contributor to this project for >12 years

> I have had extensive exposure to the original codebase: I've been maintaining it for over a decade. A traditional clean-room approach involves a strict separation between people with knowledge of the original and people writing the new implementation, and that separation did not exist here.

> I reviewed, tested, and iterated on every piece of the result using Claude.

> I was deeply involved in designing, reviewing, and iterating on every aspect of it.

Lerc

4 hours ago

There was a paper that proposed a content based hashing mask for traning

The idea is you have some window size, maybe 32 tokens. Hash it into a seed for a pseudo random number generator. Generate random numbers in the range 0..1 for each token in the window. Compare this number against a threshold. Don't count the loss for any tokens with a rng value higher than the threshold.

It learns well enough because you get the gist of reading the meaning of something when the occasional word is missing, especially if you are learning the same thing expressed many ways.

It can't learn verbatim however. Anything that it fills in will be semantically similar, but different enough to get cause any direct quoting onto another path after just a few words.

calny

3 hours ago

Thanks! Appreciate the response and will look into this

layer8

3 hours ago

> Is anyone working on this?

There was recently https://news.ycombinator.com/item?id=47131225.

calny

3 hours ago

Thanks! I missed that. The attribution by training data source category (arxiv vs wikipedia vs nemotron etc.) is an interesting approach.

oofbey

6 hours ago

The difference in indemnification based on which plan you’re on is super important. Thanks for pointing that out - never would have thought to look.

amelius

5 hours ago

Is this clause even legally valid?

How can the user know if the LLM produces anything that violates copyright?

(Of course they shouldn't have trained it on infringing content in the first place, and perhaps used a different model for enterprise, etc.)

galaxyLogic

4 hours ago

"... If AI-generated code cannot be copyrighted (as the courts suggest) ".

So, Supreme Court has said that. AI-produced code can not be copyrighted. (Am I right?). Then who's to blame if AI produces code large portions of which already exist coded and copyrigted by humans (or corporations).

I assume it goes something like this:

A) If you distribute code produced by AI, YOU cannot claim copyright to it.

B) If you distribute code produced by AI, YOU CAN be held liable for distributing it.

jcranmer

3 hours ago

SCOTUS hasn't ruled on any AI copyright cases yet. But they've said in Feist v Rural (1991) that copyright requires a minimum creative spark. The US Copyright Office maintains that human authorship is required for copyright, and the 9th Circuit in 2019 explicitly agreed with the law that a non-human animal cannot hold any copyright.

Functionally speaking, AI is viewed as any machine tool. Using, say, Photoshop to draw an image doesn't make that image lose copyright, but nor does it imbue the resulting image with copyright. It's the creativity of the human use of the tool (or lack thereof) that creates copyright.

Whether or not AI-generated output a) infringes the copyright of its training data and b) if so, if it is fair use is not yet settled. There are several pending cases asking this question, and I don't think any of them have reached the appeals court stage yet, much less SCOTUS. But to be honest, there's a lot of evidence of LLMs being able to regurgitate training inputs verbatim that they're capable of infringing copyright (and a few cases have already found infringement in such scenarios), and given the 2023 Warhol decision, arguing that they're fair use is a very steep claim indeed.

larodi

2 hours ago

The lack thereof (of human use). Prompts are not copyrightable thus the output also - not. Besides retelling a story is fair use, right? Otherwise we should ban all generative AI and prepare for Dune/Foundation future. But we not there, and we perhaps never going to be.

So the LLM training first needs to be settled, then we talk whether retelling a whole software package infringes anyone's right. And even if it does, there are no laws in place to chase it.

jcranmer

2 hours ago

> Besides retelling a story is fair use, right?

Actually, most of the time, it is not.

tzs

2 hours ago

The Supreme Court has not ruled on this issue. An appeal of a lower court's ruling on this issue was appealed to the Supreme Court but the Supreme Court declined to accept the case.

The Supreme Court has "original jurisdiction" over some types of cases, which means if someone brings such a case to them they have to accept it and rule on it, and they have "discretionary jurisdiction" over many more types of cases, which means if someone brings one of those they can choose whether or not they have to accept it. AI copyright cases are discretionary jurisdiction cases.

You generally cannot reliable infer what the Supreme Court thinks of the merits of the case when they decline to accept it, because they are often thinking big picture and longer term.

They might think a particular ruling is needed, but the particular case being appealed is not a good case to make that ruling on. They tend to want cases where the important issue is not tangled up in many other things, and where multiple lower appeals courts have hashed out the arguments pro and con.

When the Supreme Court declines the result is that the law in each part of the country where an appeals court has ruled on the issue is whatever that appeals court ruled. In parts of the country where no appeals court has ruled, it will be decided when an appeal reaches their appeals courts.

If appeals courts in different areas go in different directions, the Supreme Court will then be much more likely to accept an appeal from one of those in order to make the law uniform.

throwup238

3 hours ago

IANAL but I was under the impression that Supreme Court ruling was very specific to the AI itself copyrighting its own produced code. Once a human is involved, it gets a lot more complicated and rests on whether the human's contribution was substantial enough to make it copyrightable under their person.

aeon_ai

4 hours ago

You've likely paid attention to the litigation here. Regardless of what remains to be litigated, the training in and of itself has already been deemed fair use (and transformative) by Alsup.

Further, you know that ideas are not protected by copyright. The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

If it were the case that the LLM ingested the code and regurgitated it (as would be the premise of highlighting the training data provenance), that similarity would be much higher. That is not the case.

calny

4 hours ago

You're right, I've followed the litigation closely. I've advocated for years that "training is fair use" and I'm generally an anti-IP hawk who DEFENDS copyright/trademark cases. Only recently have I started to concede the issue might have more nuance than "all training is fair use, hard stop." And I still think Judge Alsup got it right.

That said, even if model training is fair use, model output can still be infringing. There would be a strong case, for example, if the end user guides the LLM to create works in a way that copies another work or mimics an author or artist's style. This case clearly isn't that. On the similarity at issue here, I haven't personally compared. I hope you're right.

overfeed

38 minutes ago

> The code comparison in this demonstrates a relatively strong case that the expression of the idea is significantly different from that of the original code.

Can I use one AI agent to write detailed tests based on disassembled Windows, and another to write code that passes those same function-level tests? If so, I'm about to relicense Windows 11 - eat my shorts, ReactOS!

danlitt

10 hours ago

I am pretty sure this article is predicated on a misunderstanding of what a "clean room" implementation means. It does not mean "as long as you never read the original code, whatever you write is yours". If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy. Traditionally, a human-driven clean room implementation would have a vanishingly small probability of matching the original codebase enough to be considered a copy. With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly). Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

What the chardet maintainers have done here is legally very irresponsible. There is no easy way to guarantee that their code is actually MIT and not LGPL without auditing the entire codebase. Any downstream user of the library is at risk of the license switching from underneath them. Ideally, this would burn their reputation as responsible maintainers, and result in someone else taking over the project. In reality, probably it will remain MIT for a couple of years and then suddenly there will be a "supply chain issue" like there was for mimemagic a few years ago.

femto

8 hours ago

> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

That's not what the law says [1]. If two people happen to independently create the same thing they each have their own copyright.

If it's highly improbable that two works are independent (eg. the gcc code base), the first author would probably go to court claiming copying, but their case would still fail if the second author could show that their work was independent, no matter how improbable.

[1] https://lawhandbook.sa.gov.au/ch11s13.php?lscsa_prod%5Bpage%...

jerf

6 hours ago

It is true that if two people happen to independently create the same thing, they each have their own copyright.

It is also true that in all the cases that I know about where that has occurred the courts have taken a very, very, very close look at the situation and taken extensive evidence to convince the court that there really wasn't any copying. It was anything but a "get out of jail free" card; it in fact was difficult and expensive, in proportion to the size of the works under question, to prove to the court's satisfaction that the two things really were independent. Moreover, in all the cases I know about, they weren't actually identical, just, really really close.

No rational court could possibly ever come to that conclusion if someone claimed a line-by-line copy of gcc was written by them, they must have independently come up with it. The probably of that is one out of ten to the "doesn't even remotely fit in this universe so forget about it". The bar to overcoming that is simply impossibly high, unlike two songs that happen to have similar harmonies and melodies, given the exponentially more constrained space of "simple song" as compared to a compiler suite.

gruez

6 hours ago

All of this is moot for the purposes of LLM, because it's almost certain that the LLMs were trained on the code base, and therefore is "tainted". You can't do this with humans either. Clean room design requires separate people for the spec/implementation.

wareya

6 hours ago

That's the "but their case would still fail if the second author could show that their work was independent, no matter how improbable" part of the post you're responding to.

jerf

6 hours ago

One out of ten to the power of "forget about it" is not improbable, it's impossible.

I know it's a popular misconception that "impossible" = a strict, statistical, mathematical 0, but if you try to use that in real life it turns out to be pretty useless. It also tends to bother people that there isn't a bright shining line between "possible" and "impossible" like there is between "0 and strictly not 0", but all you can really do is deal with it. Where ever the line is, this is literally millions of orders of magnitude on the wrong side of it. Not a factor of millions, a factor of ten to the millions. It's not possible to "accidentally" duplicate a work of that size.

wareya

5 hours ago

It sounds to me like you're responding to a different argument than they're actually making and reading intent into it that isn't written into it.

danlitt

4 hours ago

Thank you for providing a reference! I certainly admit that "very similar photographs are not copies" as the reference states. And certainly physical copying qualifies as copying in the sense of copyright. However I still think copying can happen even if you never have access to a copy.

I suppose a different way of stating my position is that some activities that don't look like copying are in fact copying. For instance it would not be required to find a literal copy of the GCC codebase inside of the LLM somehow, in order for the produced work to be a copy. Likewise if I specify that "Harry Potter and the Philosopher's Stone is the text file with hash 165hdm655g7wps576n3mra3880v2yzc5hh5cif1x9mckm2xaf5g4" and then someone else uses a computer to brute force find a hash collision, I suspect this would still be considered a copy.

I think there is a substantial risk that the automatic translation done in this case is, at least in part, copying in the above sense.

brians

8 hours ago

I do not agree with your interpretation of copyright law. It does ban copies: there has to be information flow from the original to the copy for it to be a "copy." Spontaneous generation of the same content is often taken by the courts to be a sign that it's purely functional, derived from requirements by mathematical laws.

Patent law is different and doesn't rely on information flow in the same way.

kevin_thibedeau

6 hours ago

Derivative works can also run afoul of copyright. An LLM trained on a corpus of copyrighted code is creating derivative works no matter how obscure the process is.

wareya

6 hours ago

This actually isn't what legal precedent currently says. The precedent is currently looking at actual output, not models being tainted. If you think this is morally wrong, look into getting the laws changed (serious).

Georgelemental

6 hours ago

What about a human trained on having 30 years of experience working with copyrighted codebases?

mftrhu

4 hours ago

Said human would likely not be able to create a clean-room implementation of any of the codebases they worked on.

aeon_ai

5 hours ago

Judge Alsup -- U.S. District Judge William Alsup said Anthropic made "fair use" of books, deeming it "exceedingly transformative."

"Like any reader aspiring to be a writer, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different"

danlitt

4 hours ago

I disagree that information flow is required. Do you have a reference for that? Certainly it is an important consideration. But consider all the real literary works contained in the infinite library of babel.[1] Are they original works just because no copy was used to produce them?

[1]: https://libraryofbabel.info/

Filligree

2 hours ago

Yes; the works are original.

However, describing the path you need to get there requires copyright infringement.

BoredPositron

7 hours ago

Well discovery might be a fun exercise to see if the code is in the dataset of the llm.

bjord

7 hours ago

if?

petercooper

8 hours ago

The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

I know you were simplifying, and not to take away from your well-made broader point, but an API-derived implementation can still result in problems, as in Google vs Oracle [1]. The Supreme Court found in favor of Google (6-2) along "fair use" lines, but the case dodged setting any precedent on the nature of API copyrightability. I'm unaware if future cases have set any precedent yet, but it just came to mind.

[1]: https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....

lokar

4 hours ago

Yeah, a cleanroom re-write, or even "just" a copy of the API spec is something to raise as a defense during a trial (along with all other evidence), it's not a categorical exemption from the law.

Also, I find it important that here the API is really minimal (compared to the Java std lib), the real value of the library is in the internal detection logic.

danlitt

4 hours ago

This is exactly what I had in mind when I said I was simplifying :) it is a valid point.

zabzonk

8 hours ago

> It does not mean "as long as you never read the original code, whatever you write is yours"

I think there is precedence that says exactly this - for example the BIOS rewrites for the IBM PC from people like Phoenix. And it would be trivial to instruct an LLM to prefer to use (say, in assembler) register C over register B wherever that was possible, resulting in different code.

danlitt

3 hours ago

As long as you never read the original code, it is very likely that whatever you write is yours. So I would not be surprised to read judges indicating in this direction. But I would be a little surprised to find out this was an actual part of the test, rather than an indication that the work was considered to have been copied. There are for instance lots of ways of reproducing copyrighted work without using a copy directly, but naive methods like generating random pieces of text are very time consuming, so there is not much precedence around them. LLMs are much more efficient at it!

bandrami

8 hours ago

Different but still derivative

zabzonk

8 hours ago

Well, I am not exactly a hotshot 8086 programmer (though I do alright) but if I was asked to reproduce the IBM BIOS (which I have seen) I think I would come up with something very similar but not identical - it is really not rocket science code, so the LLM replacing me would have rather few alternatives to choose from.

wareya

7 hours ago

> If you had a hermetically sealed code base that just happened to coincide line for line with the codebase for GCC, it would still be a copy.

If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright. You're going to get sued and lose, but platonically, you're in the clear. If it's merely somewhat similar, then you're probably in the clear in practice too: it gets very easy very fast to argue that the similarities are structural consequences of the uncopyrightable parts of the functionality.

> The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation (I am simplifying slightly).

This is almost the opposite of correct. A clean room implementation's dirty phase produces a specification that is allowed to include uncopyrightable implementation details. It is NOT defined as producing an API, and if you produce an API spec that matches the original too closely, you might have just dirtied your process by including copyrightable parts of the shape of the API in the spec. Google vs Oracle made this more annoying than it used to be.

> Whether the reimplementation is actually a "new implementation" is a subjective but empirical question that basically hinges on how similar the new codebase is to the old one. If it's too similar, it's a copy.

If you follow CRRE, it's not a copy, full stop, even if it's somehow 1:1 identical. It's going to be JUDGED as a copy, because substantial similarity for nontrivial amounts of code means that you almost certainly stepped outside of the clean room process and it no longer functions as a defense, but if you did follow CRRE, then it's platonically not a copy.

> What the chardet maintainers have done here is legally very irresponsible.

I agree with this, but it's probably not as dramatic as you think it is. There was an issue with a free Japanese font/typeface a decade or two ago that was accused of mechanically (rather than manually) copying the outlines of a commercial Japanese font. Typeface outlines aren't copyrightable in the US or Japan, but they are in some parts of Europe, and the exact structure of a given font is copyrightable everywhere (e.g. the vector data or bitmap field for a digital typeface, as opposed to the idea of its shape). What was the outcome of this problem? Distros stopped shipping the font and replaced it with something vaguely compatible. Was the font actually infringing? Probably not, but better safe than sorry.

danlitt

3 hours ago

> If you somehow actually randomly produce the same code without a reference, it's not a copy and doesn't violate copyright.

I don't believe this, and I doubt that the sense of copying in copyright law is so literal. For instance, if I generated the exact text of a novel by looking for hash collisions, or by producing random strings of letters, or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it. There need not have been any copy anywhere near me for this to happen. Whether it is likely or not depends on the technique used - naive techniques make this very unlikely, but techniques can improve.

It is also true that similarity does not imply copying - if you and I take an identical photograph of the same skyline, I have not copied you and you have not copied me, we have just fixed the same intangible scene into a medium. The true subjective test for copying is probably quite nuanced, I am not sure whether it is triggered in this case, but I don't think "clean room LLMs" are a panacea either.

> dirty phase produces a specification ... it is NOT defined as producing an API

This does not really sound like "the opposite of correct". APIs are usually not copyrightable, the truth is of course more complicated, if you are happy to replace "API" with "uncopyrightable specification" then we can probably agree and move on.

> it's probably not as dramatic as you think it is

In reality I am very cynical and think nothing will come of this, even if there are verbatim snippets in the produced code. People don't really care very much, and copyright cases that aren't predicated on millions of dollars do not survive the court system very long.

wareya

3 hours ago

> I don't believe this, and I doubt that the sense of copying in copyright law is so literal.

It is actually that literal, really.

> For instance, if I generated the exact text of a novel by looking for hash collisions,

This is a copyright violation because you're using the original to construct the copy. It's not a pure RNG.

> or by producing random strings of letters,

This wouldn't be a copyright violation, but nobody would believe you.

> or by hammering the middle button on my phone's autosuggestion keyboard, I would still have produced a copy and I would not be safe to distribute it.

This would probably be a copyright violation.

You probably think that this is hypothetical, but problems like this do actually go to court all the time, especially in the music industry, where people try to enforce copyright on melodies that have the informational uniqueness of an eight-word sentence.

> APIs are usually not copyrightable,

This was commonly believed among developers for a long time, but it turned out to not be true.

> This does not really sound like "the opposite of correct".

The important part is that information about the implementation can absolutely be in the spec without necessarily being copyrightable (and in real world clean room RE, you end up with a LOT of implementation details). You were saying the opposite, that it was a spec of the API as opposed to a spec of the implementation.

thousand_nights

7 hours ago

the whole concept of a "clean room" implementation sounds completely absurd.

a bunch of people get together, rewrite something while making a pinky promise not to look at the original source code

guaranteeing the premise is basically impossible, it sounds like some legal jester dance done to entertain the already absurd existing copyright laws

myrmidon

7 hours ago

> it sounds like some legal jester dance done to entertain [...] copyright laws

Clean room implementations are a jester dance around the judiciary. The whole point is to avoid legal ambiguity.

You are not required to do this by law, you are doing this voluntarily to make potential legal arguments easier.

The alternative is going over the whole codebase in question and arguing basically line by line whether things are derivative or not in front of a judge (which is a lot of work for everyone involved, subjective, and uncertain!).

bandrami

6 hours ago

In the archetypal example IBM (or whoever it was) had to make sure the two engineering teams were never in the cafeteria together at the same time

dudeinhawaii

7 hours ago

It usually refers to situations without access to the source code.

I've always taken "clean room" to be the kind of manufacturing clean room (sealed/etc). You're given a device and told "make our version". You're allowed to look, poke, etc but you don't get the detailed plans/schematics/etc.

In software, you get the app or API and you can choose how to re-implement.

In open source, yes, it seems like a silly thing and hard to prove.

Forgeties79

7 hours ago

Halt and Catch Fire did a pretty funny rendition of this song and dance

foooorsyth

7 hours ago

>The actual meaning of a "clean room implementation" is that it is derived from an API and not from an implementation

This is incorrect and thinking this can get you sued

https://en.wikipedia.org/wiki/Structure,_sequence_and_organi...

amiga386

4 hours ago

Whether you get sued is more on the plaintiff than you.

Per your link, the Supreme Court's thinking on "structure, sequence and organization" (Oracle's argument why Google shouldn't even be allowed to faithfully produce a clean-room implementation of an an API) has changed since the 1980s out of concern that using it to judge copyright infringement risks handing copyright holders a copyright-length monopoly over how to do a thing:

> enthusiasm for protection of "structure, sequence and organization" peaked in the 1980s [..] This trend [away from "SS&O"] has been driven by fidelity to Section 102(b) and recognition of the danger of conferring a monopoly by copyright over what Congress expressly warned should be conferred only by patent

The Supreme Court specifically recognised Google's need to copy the structure, sequence and organization of Java APIs in order to produce a cleanroom Android runtime library that implemented Java APIs so that that existing Java software could work correctly with it.

Similarly, see Oracle v. Rimini Street (https://cdn.ca9.uscourts.gov/datastore/opinions/2024/12/16/2...) where Rimini Street has been producing updates that work with Oracle's products, and Oracle claimed this made them derivative works. The Court of Appeals decided that no, the fact A is written to interoperate with B does not necessarily make A a derivative work of B.

danlitt

3 hours ago

I did not expect people to take "API" so literally. This point is what I was referring to when I said "I am simplifying slightly". The point is that a clean room impl begins from a specification of what the software does, and that the new implementation is purported to be derived only from this. What I am trying to say is that "not looking at the implementation" is not exactly the point of the test - that is a rule of thumb, which works quite well for avoiding copyright infringement, but only when humans do it.

umvi

5 hours ago

You can be sued for any reason if a company feels threatened (see: Oracle v Google)

j45

5 hours ago

This reminds me of a full rewrite.

When a developer reimplements a complete new version of code from scratch, with an understanding only, a new implementation generally should be an improvement on any source code not equal.

In today’s world, letting LLMs replicate anything will generate average code as “good” and generally create equivalent or more bloat anyways unless well managed.

StilesCrisis

4 hours ago

The world is chock-full of rewrites that came out disastrously worse than the thing they intended to replace. One of Spolsky's most-quoted articles of all time was about this.

https://www.joelonsoftware.com/2000/04/06/things-you-should-...

> They did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch.

j45

an hour ago

Oh, for sure, rewrites generally do fail especially if the incoming lessons from the existing version aren't clear.

Finding a middle ground of building a roadmap to refactoring your way forward is often much better.

Appreciate the Joel link, nice to see that kind of stuff again.

With that being said if it's the same small team that built the first version, there can be a calculated risk to driving a refactor towards a rewrite with the right conditions. I says this because I have been able to do it in this conditions a few times, it still remains very risky. If it's a new or different team later on trying to rewrite, all bets are off anyways.

We have to remember 70% of software projects fail at the best of times, independent of rewrites.

jen20

5 hours ago

> What the chardet maintainers have done here is legally very irresponsible.

Perhaps the maintainer wants to force the issue?

> Any downstream user of the library is at risk of the license switching from underneath them.

Checking the license of the transitive closure of your dependencies is table stakes for using them.

danlitt

3 hours ago

> Perhaps the maintainer wants to force the issue?

I doubt it, and I don't see any evidence that's what they're doing. There are probably better ways, if that's what they want.

> Checking the license of the transitive closure of your dependencies is table stakes for using them.

Checking the license of the transitive closure of your dependencies is only feasible when the library authors behave responsibly.

dathinab

9 hours ago

the author speaks about code which is syntactically completely different but semantically does the same

i.e. a re-implementation

which can either

- be still derived work, i.e. seen as you just obfuscating a copyright violation

- be a new work doing the same

nothing prevents an AI from producing a spec based on a API, API documentation and API usage/fuzzing and then resetting the AI and using that spec to produce a rewrite

I mean "doing the same" is NOT copyright protection, you need patent law for that. Except even with patent law you need to innovations/concepts not the exact implementation details. Which means that even if there are software patents (theoretically,1) most things done in software wouldn't be patentable (as they are just implementation details, not inventions)

(1): I say theoretically because there is a very long track record of a lot of patents being granted which really should never be granted. This combined with the high cost of invalidating patents has caused a ton of economical damages.

jacquesm

9 hours ago

No, that depends on whether or not the AI work product rests on key contributions to its training set without which it would not be able to the the work, see other comment. In that case it looks like 'a new work doing the same' but it still a derived work.

Ted Nelson was years ahead of the future where we really needed his Xanadu to keep track of fractional copyright. Likely if we had such a mechanism, and AI authors respected it then we would be able to say that your work is derived from 3000 other original works and that you added 6 lines of new code.

uyzstvqs

8 hours ago

No, training and inference are two separate processes. Training data is never redistributed, only obtained and analyzed. What matters is what data is put into context during inference. This is controlled by the user.

AI/ML is complex, so as a simpler analogy: If I watch The Simpsons, and I create an amusing infographic of how often Homer says "D'oh!" over time, my infographic would be an original work. AI training follows the same principle.

jacquesm

8 hours ago

> my infographic would be an original work.

> AI training follows the same principle.

If you really believe that then we can't have a meaningful conversation about this, that's not even ELIF territory, that's just disconnected. You should be asking questions, not telling people how it works.

ndriscoll

7 hours ago

How exactly is it different? All the model itself is is a probability distribution for next token given input, fitted to a giant corpus. i.e. a description of statistical properties. On its own it doesn't even "do" anything, but even if you wrap that in a text generator and feed it literal gcc source code fragments as input context, it will quickly diverge. Because it's not a copy of gcc. It doesn't contain a copy of gcc. It's a description of what language is common in code in general.

In fact we could make this concrete: use the model as the prediction stage in a compressor, and compress gcc with it. The residual is the extent to which it doesn't contain gcc.

jacquesm

7 hours ago

There already have been multiple documented cases of LLMs spitting out fairly large chunks of the input corpus. There have been some experiments to get it to replicate the entirety of 'Moby Dick' with some success for one model but less success with others most likely due to output filtering to prevent the generation of such texts, but that doesn't mean they're not in there in some form. And how could they not be, it is just a lossy compression mechanism, the degree of loss is not really all that relevant to the discussion.

ndriscoll

6 hours ago

Are you referring to this?

https://osyuksel.github.io/blog/reconstructing-moby-dick-llm...

I see a test where one model managed to 85% reproduce a paragraph given 3 input paragraphs under 50% of the time.

So it can't even produce 1 paragraph given 3 as input, and it can't even get close half the time.

"Contains Moby Dick" would be something like you give it the first paragraph and it produces the rest of the book. What we have here instead is a statistical model that when given passages can do an okay job at predicting a sentence or two, but otherwise quickly diverges.

xyzzy_plugh

6 hours ago

I'm no longer certain what point you're trying to make.

Getting close less than half the time given three paragraphs as input still sounds like red-handed copyright infringement to me.

If I sample a copyrighted song in my new track, clip it, slow it down, and decimate the bit rate, a court would not let me off the hook.

It doesn't matter how much context you push into these things. If I feed them 50% of Moby Dick and they produce the next word, and I can repeatedly do that to produce the entire book (I'm pretty sure the number of attempts is wholly irrelevant: we're impossibly far from monkeys on typewriters) then we can prove the statistical model encodes the book. The further we are from that (and the more we can generate with less) then the stronger the case is. It's a pretty strong case!

ndriscoll

6 hours ago

That's... not how this works.

> If I feed them 50% of Moby Dick and they produce the next word and I can repeatedly do that to produce the entire book... then we can prove the statistical model encodes the book.

It can't because it doesn't. That's what it means to say it diverges.

The "number of attempts" is you cheating. You're giving it the book when you let it try again word by word until it gets the correct answer, and then claiming it produced the book. That's exactly the residual that I said characterizes the extent to which it doesn't know the book. Trivially, no matter how bad the model is, if you give it the residual, it can losslessly compress anything at all.

If you had a simple model that just predicts next word given current word (trained on word pair frequency across all English text, or even all text excluding Moby Dick), and then give it retries until it gets the current word right, it will also quickly produce the book. Because it was your retry policy that encoded the book, not the model. Without that policy, it will get it wrong within a few words, just like these models do.

xyzzy_plugh

5 hours ago

But it does encode it! Each subsequent token's probability space encodes the next word(s) of the book with a non-zero probability that is significantly higher than random noise.

If you had access to a model's top p selection then I'd bet the book is in there consistently for every token. Is it statistically significant? Might be!

I'm not cheating because the number of attempts is so low it's irrelevant.

If I were to take a copyrighted work and chunk it up into 1000 pieces and encrypt each piece with a unique key, and give you all the pieces and keys, would it still be the copyrighted work? What if I shave off the last bit of each key before I give them to you, so you have a 50% chance of guessing the correct key for each piece? What if I shave two bits? What if it's a million pieces? When does it become transformative or no longer infringing for me to distribute?

The answer might surprise you.

ndriscoll

5 hours ago

Your test is more like the following:

Consider a password consisting of random words each chosen from a 4k dictionary. Say you choose 10 words. Then your password has log_2(4k)*10 entropy.

Now consider a validator that tells you when you gets a word right. Then you can guess one word at a time, and your password strength is log_2(4k*10). Exponentially weaker.

You're constructing the second scenario and pretending it's the first.

Also in your 50% probability scenario, each word is 1 bit, and even 50-100 bits is unguessable. A 1000 word key where each word provides 1 bit would be absurdly strong.

xyzzy_plugh

5 hours ago

You're still missing the point. The numbers don't matter because it's copyright infringement as long as I can get the book out. As long as I know the key, or the seed, I can get the book out. In court, how would you prove it's not infringement?

ndriscoll

5 hours ago

Because you put the book in. Again, this is measurable. Compress the book with a model as the predictor. The residual is you having to give it the answer. It's literally you telling it the book.

jacquesm

6 hours ago

The point is that the AI's themselves and their backers are on the record as saying that the AI could reproduce copyrighted works in their entirety but that there are countermeasures in place to stop them from doing so.

I wonder what the results would be if I spent time to train a model up from scratch without any such constraints. But I'm much too busy with other stuff right now, but that would be an interesting challenge.

ndriscoll

5 hours ago

Yeah just like a star could appear inside of Earth from quantum pair production at any given moment. But realistically, it can't. And you can't even show a test where any model can get more than a few tokens in a row correct.

These companies just don't want to deal with people complaining that it reproduces something when they don't understand that they're literally giving it the answer.

jacquesm

3 hours ago

You do realize you are now arguing against your own case don't you?

gus_massa

5 hours ago

For an infographic, perhaps you claim claim fair use. I think it makes a lot of sense, but IANAL.

For a fan fiction episode that is different from all official episodes, you may cross your fingers.

For a remake of one of the episodes with a different camera angle and similar dialog, I expect that you will get in problems.

ndriscoll

5 hours ago

Is the claim that these models can 1 shot a Simpsons episode remake with different camera angle and similar dialog from a prompt like "produce Simpsons episode S01E04"? Or are we falling into the "the user doesn't notice that they told the model the answer, and the model in fact did not memorize the thing" trap?

pmarreck

8 hours ago

> With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).

I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)

https://github.com/pmarreck?tab=repositories&type=source

I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec

Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.

With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.

ostacke

8 hours ago

Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.

galaxyLogic

4 hours ago

BUt LLM has seen millions (?) of other code-bases too. If you give it a functional spec it has no reason to prefer any one of those code-bases in particular. Except perhaps if it has seen the original spec (if such can be read from public sources) associated with the old implementation, and the new spec is a copy of the old spec.

sarchertech

2 hours ago

Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

airza

8 hours ago

By what means did you make sure your LLM was not trained with data from the original source code?

MrManatee

4 hours ago

Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

danlitt

4 hours ago

I only said the probability is higher, not that the probability is 1!

pornel

8 hours ago

Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Even copyright laws with provisions for machine learning were written when that meant tangential things like ranking algorithms or training of task-specific models that couldn't directly compete with all of their source material.

For code it also completely changes where the human-provided value is. Copyright protects specific expressions of an idea, but we can auto-generate the expressions now (and the LLM indirection messes up what "derived work" means). Protecting the ideas that guided the generation process is a much harder problem (we have patents for that and it's a mess).

It's also a strategic problem for GNU. GNU's goal isn't licensing per se, but giving users freedom to control their software. Licensing was just a clever tool that repurposed the copyright law to make the freedoms GNU wanted somewhat legally enforceable. When it's so easy to launder code's license now, it stops being an effective tool.

GNU's licensing strategy also depended on a scarcity of code (contribute to GCC, because writing a whole compiler from scratch is too hard). That hasn't worked well for a while due to permissive OSS already reducing scarcity, but gen AI is the final nail in the coffin.

pocksuppet

7 hours ago

It's not a problem. If you give a work to an AI and say "rewrite this", you created a derivative work. If you don't give a work to an AI and say "write a program that does (whatever the original code does)" then you didn't. During discovery the original author will get to see the rewriter's Claude logs and see which one it is. If the rewriter deleted their Claude logs during the lawsuit they go to jail. If the rewriter deleted their Claude logs before the lawsuit the court interprets which is more likely based on the evidence.

hennell

4 hours ago

But the AI has the work to derive from already. I just went to Gemini and said "make me a picture of a cartoon plumber for a game design". Based on your logic the image it made me of a tubby character with a red cap, blue dungarees, red top and a big bushy mustache is not a derivative work...

(interestingly asking it to make him some friends it gave me more 'original' ideas, but asking it to give him a brother and I can hear the big N's lawyers writing a letter already...)

buckle8017

6 hours ago

Except Claude was for sure trained on the original work and when asked to produce a new product that does the same thing will just spit out a (near) copy

umvi

5 hours ago

Ok, but what if in the future I could guarantee that my generative model was not trained on the work I want to replicate. Like say X library is the only library in town for some task, but it has a restrictive license. Can I use a model that was guaranteed not trained on X to generate a new library Z that competes with X with a more permissive license? What if someone looks and finds a lot of similarities?

vunderba

3 hours ago

This is what Adobe ostensibly is trying to do with their GenAI image model, Firefly.

https://en.wikipedia.org/wiki/Adobe_Firefly

buckle8017

4 hours ago

I wish you luck proving it wasn't trained on the original library or any work that infringed itself.

airforce1

4 hours ago

I think there could be a market for "permissive/open models" in the future where a company specifically makes LLM models that are trained on a large corpus of public domain or permissively licensed text/code only and you can prove it by downloading the corpus yourself and reproducing the exact same model if desired. Proving that all MIT licensed code is non-infringing is probably impossible though at that point copyright law is meaningless because everyone would be in violation if you dig deep enough.

leecommamichael

7 hours ago

“Changing the equation” by boldly breaking the law.

Majromax

3 hours ago

> “Changing the equation” by boldly breaking the law.

Is it? I think the law is truly undeveloped when it comes to language models and their output.

As a purely human example, suppose I once long ago read through the source code of GCC. Does this mean that every compiler I write henceforth must be GPL-licensed, even if the code looks nothing like GCC code?

There's obviously some sliding scale. If I happen to commit lines that exactly replicate GCC then the presumption will be that I copied the work, even if the copying was unconscious. On the other hand, if I've learned from GCC and code with that knowledge, then there's no copyright-attaching copy going on.

We could analogize this to LLMs: instructions to copy a work would certainly be a copy, but an ostensibly independent replication would be a copy only if the work product had significant similarities to the original beyond the minimum necessary for function.

However, this is intuitively uncomfortable. Mechanical translation of a training corpus to model weights doesn't really feel like "learning," and an LLM can't even pinky-promise to not copy. It might still be the most reasonable legal outcome nonetheless.

leecommamichael

an hour ago

Non-sequitur. It can be both.

mlinhares

4 hours ago

Its only breaking the law if you don't have enough money to pay the politicians.

empath75

7 hours ago

> Generative AI changed the equation so much that our existing copyright laws are simply out of date.

Ideas are not protected by copyright, expression of ideas is.

You can't legally copy a creative work, but you can describe the idea of the work to an AI and get a new expression of it in a fraction of the time it took for the original creator to express their idea.

The whole premise of copyright is that ideas aren't the hard part, the work of bringing that idea to fruition is, but that may no longer be true!

satvikpendem

6 hours ago

Honestly, good. Copyright and IP law in general have been so twisted by corporations that only they benefit now, see Mickey Mouse laws by Disney for example, or patenting obvious things like Nintendo or even just patent trolling in general.

hamdingers

2 hours ago

The biggest recording artist in the world right now had to re-record her early albums because she didn't own the copyright, imagine how many artists don't get that big and never have that opportunity.

That individual artists are still defending this system is baffling to me.

ajross

7 hours ago

> GNU's goal isn't licensing per se, but giving users freedom to control their software.

I think that's maybe misunderstanding. GNU wants everyone to be able to use their computers for the purposes they want, and software is the focus because software was the bottleneck. A world where software is free to create by anyone is a GNU utopia, not a problem.

Obviously the bigger problem for GNU isn't software, which was pretty nicely commoditized already by the FOSS-ate-the-world era of two decades ago; it's restricted hardware, something that AI doesn't (yet?) speak to.

nairboon

15 hours ago

That code is still LGPL, it doesn't matter what some release engineer writes in the release notes on Github. All original authors and copyright holders must have explicitly agreed to relicense under a different license, otherwise the code stays LGPL licensed.

Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.

pocksuppet

7 hours ago

In the legal system there's no such thing as "code that is LGPL". It's not an xattr attached to the code.

There is an act of copying, and there is whether or not that copying was permitted under copyright law. If the author of the code said you can copy, then you can. If the original author didn't, but the author of a derivative work, who wasn't allowed to create a derivative work, told you you could copy it, then it's complicated.

And none of it's enforced except in lawsuits. If your work was copied without permission, you have to sue the person who did that, or else nothing happens to them.

pavlov

10 hours ago

If anything, the SCOTUS decision would seem to imply that generative AI transformations produce no additional creative contribution and therefore the original copyright holder has all rights to any derived AI works.

(IANAL)

dathinab

9 hours ago

that is a very good formulation of what I have been trying to say

but also probably not fully right

as far as I understand they avoid the decision of weather an AI can produce creative work by saying that the neither the AI nor it's owner/operator can claim ownership of copyright (which makes it de-facto public domain)

this wouldn't change anything wrt. derived work still having the original authors copyright

but it could change things wrt. parts in the derived work which by themself are not derived

pseudalopex

9 hours ago

The court avoided a decision of what the operator could have copyrighted because he said he was not the author.

bandrami

6 hours ago

That's a reasonable theory though it's stuck with the problem that any model will by its training be derivative of codebases that have incompatible licenses, and that in fact every single use of an LLM is therefore illegal (or at least tortious).

dathinab

9 hours ago

iff it went through the full clean room rewrite just using AI then no, it's de-facto public domain (but also it probably didn't do so)

iff it is a complete new implementation with completely different internal then it could also still be no LGPL even if produced by a person with in depth knowledge. Copyright only cares if you "copied" something not if you had "knowledge" or if it "behaves the same". So as long as it's distinct enough it can still be legally fine. The "full clean room" requirement is about "what is guaranteed to hold up in front of a court" not "what might pass as non-derivative but with legal risk".

kshri24

14 hours ago

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.

EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.

kouteiheika

14 hours ago

> I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code.

There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.

But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.

progval

13 hours ago

> There are research models out there which are trained on only permissively licensed data

Models whose authors tried to train only on permissively licensed data.

For example https://huggingface.co/bigcode/starcoder2-15b tried to be a permissively licensed dataset, but it filtered only on repository-level license, not file-level. So when searching for "under the terms of the GNU General Public License" on https://huggingface.co/spaces/bigcode/search-v2 back when it was working, you would find it was trained on many files with a GPL header.

kshri24

14 hours ago

I agree with your assessment. Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair not only because companies are using AI generated output but developers themselves are also paying and using AI generated output that is trained on other developer's input. I would feel good (in my conscience) that I am not "stealing" someone else's effort and they are being paid for it.

carlob

10 hours ago

Why settle on some private agreement between creators and ai companies where a tiny percentage is shared, let's just tax the hell out of AI companies and redistribute.

rswail

8 hours ago

Because the authors of the original content deserve recompense for their work.

That's what the whole copyright and patent regimes are designed to achieve.

It's to encourage the creation of knowledge.

US Constitution, Article I, section 8:

    To promote the Progress of Science and useful Arts, by
    securing for limited Times to Authors and Inventors the
    exclusive Right to their respective Writings 
    and Discoveries;

carlob

8 hours ago

Right, it says exclusive rights, which does not translate to "we siphon everything and you get a tiny percentage of our profits", it means I can choose to say no to all of this. To me the matter of compensation and that of authorship rights are mostly orthogonal.

kshri24

10 hours ago

> let's just tax the hell out of AI companies and redistribute.

That's not what I favor because you are inserting a middleman, the Government, into the mix. The Government ALWAYS wants to maximize tax collections AND fully utilize its budget. There is no concept of "savings" in any Government anywhere in the World. And Government spending is ALWAYS wasteful. Tenders floated by Government will ALWAYS go to companies that have senators/ministers/prime ministers/presidents/kings etc as shareholders. In other words, the tax money collected will be redistributed again amongst the top 500 companies. There is no trickle down. Which is why agreements need to be between creators and those who are enjoying fruits of the creation. What have Governments ever created except for laws that stifle innovation/progress every single time?

carlob

8 hours ago

> What have Governments ever created except for laws that stifle innovation/progress every single time?

https://www.youtube.com/watch?v=Qc7HmhrgTuQ

In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.

Apart from that, you have answered to a strawman. I said redistribute, not give to the government. I explicitly worded things that way because I don't think we should not be having a discussion on policy.

I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.

kshri24

6 hours ago

> Apart from that, you have answered to a strawman. I said redistribute, not give to the government

You said: "let's just tax the hell out of AI companies and redistribute.". Only the Government has the power to tax. Question of redistribution does not even arise without first having the power to the coffers of the Company. Which you nor I have. Government CAN have if it wants to by either Nationalizing the Company or as you said "taxing the hell out of" the company. Please explain how you would go about taxing and redistributing without involving the Government?

> In all seriousness without the government you would have no innovation and progress, because it's the public school system, functioning roads, research grants a stable and lawful society that allow you to do any kind of innovation.

These fall under the ambit of governance and hence why you have a Government. That's the only power Governments should have. Governments SHOULD NOT be managing private enterprises.

> I think we are moving to an economy where the share of profits taken by capital becomes much larger than the one take from labor. If that happens then laborers will have very little discretionary income to fuel consumption and even capitalists will end up suffering. We can choose to redistribute now or wait for it to happen naturally, however that usually happens in a much more violent way, be it hyperinflation, famine, war or revolution.

Agreed. Which is why I was proposing private agreements in the first place (without involving a third-party like the Government which, more often than not, mismanages funds).

LadyCailin

10 hours ago

Uh, no? https://en.wikipedia.org/wiki/Government_Pension_Fund_of_Nor...

Just because you have a failure of imagination for how government should work, doesn’t mean it can’t work. And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.

kshri24

9 hours ago

Pension fund is an example of what exactly? All countries have pension funds. This has nothing to do with Governments wasting money. Please go beyond tiny European countries that have very few verticals and are largely dependent on outside support for protecting their sovereignty. They are not representative of most of the World.

> As its name suggests, the Government Pension Fund Global is invested in international financial markets, so the risk is independent from the Norwegian economy. The fund is invested in 8,763 companies in 71 countries (as of 2024).

Basically what I said above. You give your tax dollars to Government and it will invest it into top 500 companies. In the Norway Pension Fund case it is 8,763 companies in 71 countries. None of them are startups/small businesses/creators.

> And stifling innovation is exactly what I want, when that innovation is “steal from everyone so we can invent the torment nexus” or whatever’s going on these days.

You are confusing current lack of laws regulating this space with innovation being evil. Innovation is not evil. The technology per se is not evil. Every innovation brings with it a set of challenges which requires us to think of new legislation. This has ALWAYS been the case for thousands of years of human innovation.

kouteiheika

11 hours ago

> Which is why I was proposing a middle-ground where an agreement is setup between the model training company and the collective of developers/artists et all and come up with a license agreement where they are rewarded for their original work for perpetuity. A tiny % of the profits can be shared, which would be a form of UBI. This is fair

That wouldn't be fair because these models are not only trained on code. A huge chunk of the training data are just "random" webpages scraped off the Internet. How do you propose those people are compensated in such a scheme? How do you even know who contributed, and how much, and to whom to even direct the money?

I think the only "fair" model would be to essentially require models trained on data that you didn't explicitly license to be released as open weights under a permissive license (possibly with a slight delay to allow you to recoup costs). That is: if you want to gobble up the whole Internet to train your model without asking for permission then you're free to do so, but you need to release the resulting model so that the whole humanity can benefit from it, instead of monopolizing it behind an API paywall like e.g. OpenAI or Anthropic does.

Those big LLM companies harvest everyone's data en-masse without permission, train their models on it, and then not only they don't release jack squat, but have the gall to put up malicious explicit roadblocks (hiding CoT traces, banning competitors, etc.) so that no one else can do it to them, and when people try they call it an "attack"[1]. This is what people should be angry about.

[1] -- https://www.anthropic.com/news/detecting-and-preventing-dist...

duskdozer

9 hours ago

>under a permissive license

well, assuming all data that is itself not permissively licensed is excluded

foota

13 hours ago

I don't know how far it would get, but I imagine that a FAANG will be able to get the farthest here by virtue of having mountains of corporate data that they have complete ownership over.

msdz

11 hours ago

They’d probably get the farthest, but they won’t pursue that because they don’t want to end up leaking the original data from training. It is possible in regular language/text subsets of models to reconstruct massive consecutive parts of the training data [1], so it ought to be possible for their internal code, too.

[1] https://arxiv.org/abs/2601.02671

pocksuppet

7 hours ago

dathinab

9 hours ago

> how does that work

AI can't claim ownership, humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.

In general it is irrelevant what the copyright of the AI training data is. At least in the US judges have been relevant clear about that. (Except if the AI reproduced input data close to verbatim. _But in general we aren't speaking about AI being trained on a code base but an AI using/rewriting it_.)

(1): Which isn't the same as no one seems to know who has ownership. It also might be owned by no-one in the sense that no one can grant you can copyright permission (so opposite of public domain), but also no-one can sue (so de-facto public domain).

jacquesm

9 hours ago

Humans can't claim ownership, but they are still liable for the product of their bot. That's why MS was so quick to indemnify their users, they know full well that it is going to be super hard to prove that there is a key link to some original work.

The main analogy is this one: you take a massive pile of copyrighted works, cut them up into small sections and toss the whole thing in a centrifuge, then, when prompted to produce a work you use a statistical method to pull pieces of those copyrighted works out of the centrifuge. Sometimes you may find that you are pulling pieces out of the laundromat in the order in which they went in, which after a certain number of tokens becomes a copyright violation.

This suggests there are some obvious ways in which AI companies can protect themselves from claims of infringement but as far as I'm aware not a single one has protections in place to ensure that they do not materially reproduce any fraction of the input texts other than that they recognize prompts asking it to do so.

So it won't produce the lyrics of 'Let it be'. But they'll be happy to write you mountains of prose that strongly resembles some of the inputs.

The fact that they are not doing that tells you all you really need to know: they know that everything that their bots spit out is technically derived from copyrighted works. They also have armies of lawyers and technical arguments to claim the opposite.

dathinab

9 hours ago

> Humans can't claim ownership, but they are still liable for the product of their bot.

sure,

but that is completely unrelated to this discussion

which is about AI using code as input to produce similar code as output

not about AI being trained on code

jacquesm

9 hours ago

> which is about AI using code as input to produce similar code as output

> not about AI being trained on code

The two are very directly connected.

The LLM would not be able to do what it does without being trained, and it was trained on copyrighted works of others. Giving it a piece of code for a rewrite is a clear case of transformation, no matter what, but now it also rests on a mountain of other copyrighted code.

So now you're doubly in the wrong, you are willfully using AI to violate copyright. AI does not create original works, period.

bluGill

8 hours ago

Every programmer is trained on the copyrighted works of others. there a vanishingly few modern programs with available source code in the public domain.

it isn't clear how/if llm is different from the brain but we all have training by looking at copywrited source code at some time.

jacquesm

8 hours ago

> it isn't clear how/if llm is different from the brain

It's very clear: the one is a box full of electronics, the other is part of the central nervous system of a human being.

> but we all have training by looking at copywrited source code at some time.

That may be so, but not usually the copyrighted source code that we are trying to reproduce. And that's the bit that matters.

You can attempt to whitewash it but at its core it is copyright infringement and the creation of derived works.

SiempreViernes

7 hours ago

> but we all have training by looking at copywrited[sic] source code at some time.

The single word "training" is here being used to describe two very different processes; what an LLM does with text during training is at basically every step fundamentally distinct from what a human does with text.

Word embedding and gradient descent just aren't anything at all like reading text!

jacquesm

6 hours ago

Indeed, but that's just a misdirection. We don't actually know how a human brain learns, so it is hard to base any kind of legal definition on that difference. Obviously there are massive differences but what those differences are is something you can debate just about forever.

I have a lot of music in my head that I've listened to for decades. I could probably replicate it note-for-note given the right gear and enough time. But that would not make any of my output copyrightable works. But if I doodle for three minutes on the piano, even if it is going to be terrible that is an original work.

pocksuppet

7 hours ago

Programmer training and AI training are legally distinct processes.

graemep

8 hours ago

> humans can't either as they haven't produced it. If there is guaranteed no one which can claim ownership it often seen as being in the public domain.

Says who?. The US ruling the article refers to does not cover this.

It is different in other countries. Even if US law says it is public domain (which is probably not the case) you had better not distribute it internationally. For example, UK law explicitly says a human is the author of machine generated content: https://news.ycombinator.com/item?id=47260110

m4rtink

8 hours ago

I would be totally fine with all code generated by LLMs being considered to be under GPL v3 unless the model authors can prove without any doubt it was not trained on any GPL v3 code - viral licensing to the max. ;-)

adrianN

14 hours ago

We‘ll have to wait until the technology progresses sufficiently that AI cuts into Disney’s profit.

shevy-java

11 hours ago

"We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."

I think it will depend on the way HOW the AI arrived to the new code.

If it was using the original source code then it probably is guilty-by-association. But in theory an AI model could also generate a rewrite if being fed intermediary data not based on that project.

dathinab

9 hours ago

> "We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not."

it depends on the country you are in

but overall in the US judges have mostly consistently ruled it as legal

and this is extremely unlikely to change/be effectively interpreted different

but where things are more complex is:

- model containing training data (instead of generic abstractions based on it), determined by weather or not it can be convinced to produce close to verbatim output of the training data the discussion is about

- model producing close to verbatim training data

the later seems to be mostly? always? be seen as copyright violation, with the issue that the person who does the violation (i.e. uses the produced output) might not known

the former could mean that not just the output but the model itself can count as a form of database containing copyright violating content. In which case they model provider has to remove it, which is technically impossible(1)... The pain point with that approach is that it will likely kill public models, while privately kept models will for every case put in a filter and _claim_ to have removed it and likely will get away with it. So while IMHO it should be a violation conceptually, it probably is better if it isn't.

But also the case the original article refers to is more about models interacting/using with code base then them being trained on.

(1): For LLMs, it is very much removable for knowledge based used by LLMs.

amelius

10 hours ago

You should just look at it as a giant computation graph. If some of the inputs in this graph are tainted by copyright and an output depends on these inputs (changing them can change the output) then the output is tainted too.

d1sxeyes

11 hours ago

> We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not.

That horse has bolted. No one knows where all the AI code any more, and it would no longer possible to be compliant with a ruling that no one can use AI generated code.

There may be some mental and legal gymnastics to make it possible, but it will be made legal because it’s too late to do anything else now.

conartist6

10 hours ago

I hate that this may be true, but I also don't think the law will fix this for us.

I think this is down the community and the culture to draw our red lines on and enforce them. If we value open source, we will find a way to prevent its complete collapse through model-assisted copyright laundering. If not, OSS will be slowly enshittified as control of projects slowly flows to the most profit-motivated entities.

d1sxeyes

8 hours ago

But what tools do we have to stop this happening? I agree, we can (and should) all refuse to participate in licence laundering, but there will always be folks less principled.

I don’t know what happens next, honestly.

conartist6

8 hours ago

I don't either, but I guess we're both about to find out. There only surety is that there will be moves and countermoves. As far as I could tell the best thing we could do right now is fund software-legal organizations like the EFF which are likely to be the ones to litigate the test cases. What's hurting us most right now is we don't know what law means in this context, so we don't fully understand the scale of what we need to protect against or what tools we have that the courts will recognize

thedevilslawyer

14 hours ago

That's unpractical enough that you might as well wish for UBI and world peace rather than this.

kshri24

14 hours ago

Why is it impractical? Github already has a sponsor system. Also this can be a form of UBI.

abrookewood

12 hours ago

This seems relevant: "No right to relicense this project (github.com/chardet)" https://news.ycombinator.com/item?id=47259177

shevy-java

11 hours ago

That's another project though, right? In this case I think it is different because that project just seems stolen. The courts can probably verify this too.

I think the main question is when a rewrite is a clean rewrite, via AI. If it is a clean rewrite they can choose any licence.

littlestymaar

10 hours ago

No, TFA is about chardet too:

> chardet , a Python character encoding detector used by requests and many others, has sat in that tension for years: as a port of Mozilla’s C++ code it was bound to the LGPL, making it a gray area for corporate users and a headache for its most famous consumer.

jerf

6 hours ago

"Accepting AI-rewriting as relicensing could spell the end of Copyleft"

True, but too weak. It ends copyright entirely. If I can do this to a code base, I can do it to a movie, to an album, to a novel, to anything.

As such, we can rest assured that for better or for worse this is going to be resolved in favor of this not being enough to strip the copyright off of something and the chardet/chardet project would be well advised not to stand in front of the copyright legal behemoth and defeat it in single combat.

samrus

14 hours ago

> The ownership void: If the code is truly a “new” work created by a machine, it might technically be in the public domain the moment it’s generated, rendering the MIT license moot.

Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?

Sharlin

12 hours ago

The point is that even a work written by an AI trained exclusively on liberally licensed or public domain material cannot have copyright (isn’t a "work" in the legal sense) and thus nobody has standing to put it under a license or claim any rights to it.

If I train a limerick generator on the contents of Project Gutenberg, no matter how creative its outputs, they’re not copyrightable under this interpretation. And it’s by far the most reasonable interpretation of the law as both intended and written. Entities that are not legal persons cannot have copyright, but legal persons also cannot claim copyright of something made by a nonperson, unless they are the "creative force" behind the work.

NitpickLawyer

13 hours ago

> To me it sounds like the AI-written work can not be coppywritten

I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.

If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?

If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?

Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?

Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...

lelanthran

11 hours ago

All of your questions have seemingly trivial answers. Maybe I am missing something, but...

> If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count?

Does the output of the macro depend on ingesting someone else's code?

> Does code that generates other code count?

Does the output of the code depend on ingesting someone else's code?

> Protobuf?

Does your protobuf implementation depend on ingesting someone else's code?

> If it's the tool that generates the code, again where do we draw the line?

Does the tool depend ingestion of of someone else's code?

> Is it just using 3rd party tools?

Does the 3rd party tool depend on ingestion of someone else's code?

> Would training your own count?

Does the training ingest someone else's code?

> Would a "random" code gen and pick the winners (by whatever means) count?

Does the random codegen depend on ingesting someone else's code?

> Bruteforce all the space (silly example but hey we're in silly space here) counts?

Does the bruteforce algo depend on ingesting someone else's code?

> Is it just "AI" adjacent that isn't copyrightable?

No, it's the "depends on ingesting someone else's code" that makes it not copyrightable.

> If so how do you define AI?

Doesn't matter whether it is AI or not, the question is are you ingesting someone else's code.

> Does autocomplete count?

Does the specific autocomplete in question depend on ingesting someone else's code?

> Intellisense?

Does the specific Intellisense in question depend on ingesting someone else's code?

> Smarter intellisense?

Does the specific Smarter Intellisense in question depend on ingesting someone else's code?

...

Look, I see where you're going with this - reductio ad absurdum and all - but it seems to me that you're trying to muddy the waters by claiming that either all code generation is allowed or no code generation is disallowed.

Let me clear the waters for all the readers - the complaint is not about code generation, it's about ingesting someone else's code, frequently for profit.

All these questions you are asking seem to me to be irrelevant and designed to shift the focus from the ingestion of other people's work to something that no one is arguing against.

NitpickLawyer

11 hours ago

Interesting.

> the complaint is not about code generation, it's about ingesting someone else's code, frequently for profit.

Why do you think that is, and what complaint specifically? I was talking about this:

> The Copyright Office reviewed the decision in 2022 and determined that the image doesn't include “human authorship,” disqualifying it from copyright protection

There seems to be 0 mentioning of training there. In fact if you read the appeal's court case [1] they don't mention training either:

> We affirm the denial of Dr. Thaler’s copyright application. The Creativity Machine cannot be the recognized author of a copyrighted work because the Copyright Act of 1976 requires all eligible work to be authored in the first instance by a human being. Given that holding, we need not address the Copyright Office’s argument that the Constitution itself requires human authorship of all copyrighted material. Nor do we reach Dr. Thaler’s argument that he is the work’s author by virtue of making and using the Creativity Machine because that argument was waived before the agency.

I have no idea where you got the idea that this was about training data. Neither the copyright office nor the appeals court even mention this.

But anyway, since we're here, let's entertain this. So you're saying that training data is the differentiator. OK. So in that case, would training on "your own data" make this ok with you? Would training on "synthetic" data be ok? Would a model that sees no "proprietary" code be ok? Would a hypothetical model trained just on RL with nothing but a compiler and endless compute be ok?

The courts seem to hint that "human authorship" is still required. I see no end to the "... but what about x", as I stated in my first comment. I was honestly asking those questions, because the crux of the case here rests on "human authorship of the piece to be copyrighted", not on anything prior.

[1] - https://fingfx.thomsonreuters.com/gfx/legaldocs/egpblokwqpq/...

lelanthran

10 hours ago

> There seems to be 0 mentioning of training there. In fact if you read the appeal's court case [1] they don't mention training either:

> ...

> I have no idea where you got the idea that this was about training data. Neither the copyright office nor the appeals court even mention this.

In both the story and the comments, that's the prevailing complaint. FTFA:

> Their claim that it is a “complete rewrite” is irrelevant, since they had ample exposure to the originally licensed code (i.e. this is not a “clean room” implementation). Adding a fancy code generator into the mix does not somehow grant them any additional rights.

I mean, I know it's passe to read the story, but I still do it so my comments are on the story, not just the title taken out of context.

> But anyway, since we're here, let's entertain this. So you're saying that training data is the differentiator.

Well, that's the complaint in the story and in the comment section, so it makes sense to address that and that alone.

> OK. So in that case, would training on "your own data" make this ok with you?

Yes.

> Would training on "synthetic" data be ok?

If provenance of "synthetic data" does not depend on some upstream ingesting someone else's work, then yes.

> Would a model that sees no "proprietary" code be ok?

If the model does not depend on someone else's work, then Yes.

> Would a hypothetical model trained just on RL with nothing but a compiler and endless compute be ok?

Yes.

*Note: Let me clarify that "someone else's work" means someone who has not consented or licended their work for ingestion and subsequent reproduction under the terms that AI/LLM training does it. If someone licensed you their work to train a model, then have at it.

NitpickLawyer

10 hours ago

Ah! I think I get where the confusion was. I was quoting something from another comment, and specifically commenting on that.

> > To me it sounds like the AI-written work can not be coppywritten

I was only commenting on that.

user34283

10 hours ago

I'm thinking that the relevant question would be whether the part where we want to know if is copyrightable is an intellectual invention of a human mind.

"Ingesting someone else's code" does not seem very useful here - it's hardly quantifiable, nor is "ingestion" the key question I believe.

laksjhdlka

14 hours ago

They say "if" it's a new work, then it might not be copyrightable, I guess. You suppose that it's still the original work, and hence it's still got that copyright.

I think they are rhetorically asking if your position is correct.

cxr

8 hours ago

FYI: the concept is "copyright" not "copywrite". It doesn't turn into "copywritten" as an adjective. The adjective is "copyrighted".

__alexs

10 hours ago

AI written absolutely is copyrightable. There are just some unresolved tensions around where the lines are and how much and what kind of involvement humans need to have in the process.

AyanamiKaine

7 hours ago

The worst problem is that a LLM could not only copy the exact code it was trained on but possibly even their comments!

There is one thing arguing that the code is a one to one copy but when the comments are even the same isn’t it quite clear it’s a copy?

duskdozer

7 hours ago

It literally did copy significant portions of docstring comments, verbatim.

raggi

7 hours ago

The human driver of the project has a comment that is reporting that the project has no structural overlap as analyzed by a plagarism analysis tool. Were comments excluded from that analysis? Is your comment here based on the data in the repo?

DrammBA

5 hours ago

> Were comments excluded from that analysis?

According to the analysis that you referenced:

> JPlag parses Python source into syntactic tokens (function definitions, assignments, control flow, etc.), discarding all variable names, comments, whitespace, and formatting

mfabbri77

14 hours ago

This has the potential to kill open source, or at least the most restrictive licenses (GPL, AGPL, ...): if a license no longer protects software from unwanted use, the only possible strategy is to make the development closed source.

_dwt

14 hours ago

Yes, this is the reason I've completely stopped releasing any open-source projects. I'm discovering that newer models are somewhat capable of reverse-engineering even compiled WebAssembly, etc. too, so I can feel a sort of "dark forest theory" taking hold. Why publish anything - open or closed - to be ripped off at negligible marginal cost?

Tiberium

14 hours ago

People are just not realizing this now because it's mostly hobby projects and companies doing it in private, but eventually everyone will realize that LLMs allow almost any software to be reverse engineered for cheap.

See e.g. https://banteg.xyz/posts/crimsonland/ , a single human with the help of LLMs reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.

seddonm1

14 hours ago

It’s a real problem. I threw it at an old MUD game just to see how hard it is [0] then used differential testing and LLMs to rewrite it [1]. Just seems to be time and money.

[0] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...

[1] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...

evanelias

an hour ago

Wow, as a former MajorMUD addict (~30 years ago) that's extremely interesting to see. Especially since MajorMUD is rarely discussed on HN, even in MUD or BBS-related threads.

Did you find it worked reasonably well on any portion of the codebase you could throw at it? For example, if I recall correctly, all of MajorMUD's data file interactions used the embedded Btrieve library which was popular at the time. For that type of specialized low-level library, I'm curious how much effort it would take to get readable code.

bogwog

7 hours ago

This is pretty much exactly why copyright laws came about in the first place. Why bother creating a book, painting, or other work of art if anyone can trivially copy it and sell it without handing you a dime?

I think refusing to publish open source code right now is the safe bet. I know I won't be publishing anything new until this gets definitively resolved, and will only limit myself to contributing to a handful of existing open source projects.

abrookewood

11 hours ago

Why does it matter if it is 'ripped off' if you released it as open source anyway? I get that you might want to impose a particular licence, but is that the only reason?

vbarrielle

6 hours ago

Even the most permissive open source licenses such as MIT require attribution. Releasing as open source would therefore benefit the author through publicity. Bein able to say that you're the author of library X, used by megacorp Y with great success, is a good selling point in a job interview.

LLM ripping off open source code removes that.

abrookewood

11 hours ago

It's not just open source, it is literally anything source-available, whether intentional or not.

GaryBluto

8 hours ago

If you'd be willing to close source your "libre" open source project because somebody might do something you don't like with it, you never wanted a "libre" project.

saagarjha

8 hours ago

In this case someone is making a non-libre project with it.

user34283

11 hours ago

I find the wording "protect from unwanted use" interesting.

It is my understanding that what a GPL license requires is releasing the source code of modifications.

So if we assume that a rewrite using AI retains the GPL license, it only means the rewrite needs to be open source under the GPL too.

It doesn't prevent any unwanted use, or at least that is my understanding. I guess unwanted use in this case could mean not releasing the modifications.

mfabbri77

10 hours ago

If the AI product is recognised as "derivative work" of a GPL-compliant project, then it must itself be licensed under the GPL. Otherwise, it can be licensed under any other license (including closed source/proprietary binary licenses). This last option is what threatens to kill open source: an author no longer has control over their project. This might work for permissive licenses, but for GPL/AGPL and similar licenses, it's precisely the main reason they exist: to prevent the code from being taken, modified, and treated as closed source (including possible use as part of commercial products or Sass).

duskdozer

7 hours ago

Yeah, the GPL is deficient in that way and doesn't handle other hostile uses.

emsign

11 hours ago

By design you can't know if the LLM doing the rewrite was exposed to the original code base. Unless the AI company is disclosing their training material, which they won't because they don't want to admit breaking the law.

shevy-java

10 hours ago

> By design you can't know if the LLM doing the rewrite was exposed to the original code base.

I agree, in theory. In practice courts will request that the decision-making process will be made public. The "we don't know" excuse won't hold; real people also need to tell the truth in court. LLMs may not lie to the court or use the chewbacca defence.

Also, I am pretty certain you CAN have AI models that explain how they originated to the decision-making process. And they can generate valid code too, so anything can be autogenerated here - in theory.

airforce1

4 hours ago

I don't see how this is different from current human poaching practices. i.e. It appears to be currently legal to hire an employee from company A who has been "tainted" by company A's [proprietary AI secrets/proprietary CPU architecture secrets/etc] in order to develop a competing offering for company B. i.e. It's not illegal for a human who worked at Intel for 20 years to go work for AMD even though they are certainly "tainted" with all sorts of copyrighted/proprietary knowledge that will surely leak through at AMD. Maybe patents are a first line of defense for company A, but that can't prevent adjacent solutions that aren't outright duplications and circumvent the patent.

soulofmischief

11 hours ago

Seeing the source for a project doesn't prevent me from ever creating a similar project, just because I've seen the code. The devil is in the details.

shevy-java

10 hours ago

Agreed, but the courts can conclude that all LLMs who are not open about their decision, have stolen things. So LLMs would auto-lose in court.

orthoxerox

10 hours ago

Or they can conclude otherwise.

gostsamo

11 hours ago

it was exposed when it was shown the thing to rewrite.

shevy-java

10 hours ago

In this context here I think that is a correct statement. But I think you can have LLMs that can generate the same or similar code, without having been exposed to the other code.

skeledrew

11 hours ago

It doesn't even matter if the LLM was exposed during training. A clean-room rewrite can be done by having one LLM create a highly detailed analysis of the target (reverse engineering if it's in binary form), and providing that analysis to another LLM to base an implementation.

k__

11 hours ago

It doesn't matter for the LLM writing the analysis.

It does matter for the one who implements it.

Finding an LLM that's good enough to do the rewrite while being able to prove it wasn't exposed to the original GPL code is probably impossible.

xyzsparetimexyz

11 hours ago

Why does it need 2 LLMs? LLMs aren't people. I'm not even sure that it needs to be done in 2 seperate contexts

skeledrew

9 hours ago

It doesn't have to be 2 LLMs, but nowadays there's LLM auto-memory, which means it could be argued that the same LLM doing both analysis and reimplementation isn't "clean". And the entire purpose behind the "clean" is to avoid that argument.

shevy-java

10 hours ago

Agreed. But even then I don't see the problem. Multiple LLMs could work on the same project.

d1sxeyes

11 hours ago

Is it against the law for an LLM to read LGPL-licensed code?

That’s a complex question that isn’t solved yet. Clearly, regurgitating verbatim LGPL code in large chunks would be unlawful. What’s much less clear is a) how large do those chunks need to be to trigger LGPL violations? A single line? Two? A function? What if it’s trivial? And b) are all outputs of a system which has received LGPL code as an input necessarily derivative?

If I learn how to code in Python exclusively from reading LGPL code, and then go away and write something new, it’s clear that I haven’t committed any violation of copyright under existing law, even if all I’m doing as a human is rearranging tokens I understand from reading LGPL code semantically to achieve new result.

It’s a trying time for software and the legal system. I don’t have the answers, but whether you like them or not, these systems are here to stay, and we need to learn how to live with them.

christina97

5 hours ago

A reminder on this topic that copyright does not protect ideas, inventions, or algorithms. Copyright protects an expression of a creative work. It makes more sense eg. with books, where of course anyone can read the book and the ideas are “free” but copying paragraphs must be scrutinized for copyright reasons. It’s always been a bit weird that copyright is the intellectual property concept that protects code.

When you write code, it is the exact sequence of characters, the expression of the code, that is protected. If you copy it and change some lines, of course it’s still protected. Maybe some way of writing an algorithm is protected. But nothing else (under copyright).

stuaxo

10 hours ago

I don't see how (with current LLMs that have been trained on mixed licensed data) you can use the LLM to rewrite to a less restrictive license.

You could probably use it to output code that is GPL'd though.

softwaredoug

2 hours ago

Basically the implication - most software has a huge second mover advantage. The creator of software puts the work in (AI assisted or not). The second mover can use an LLM to do a straightforward clone.

If you have a company that depends on software, the rest of the business (service, reliability, etc) better be rock solid because you can be guaranteed someone will do a rewrite of your stack.

alexpotato

5 hours ago

Wasn't this already a thing in the past?

e.g.

Team A:

- reads the code

- writes specifications and tests based on the code

- gives those specifications to Team B

Team B:

- reads the specs and the tests

- writes new code based on the above

The thinking being that Team B never sees the code then it's "innovative" and you are not "laundering" the code.

On a side note:

what happens in a copyright lawsuit concerning code and how hired experts investigate what happened is described in this AMAZING talk by Dave Beazley: https://www.youtube.com/watch?v=RZ4Sn-Y7AP8

rzerowan

4 hours ago

Yep , as i recall this was the original 'clean room' implementattion that was made with regard to IBM clones and the BIOS program that was used to initialize them.

Also a few years bcak theer was the csae of SAP(?) i tthink where they did a reimplementation indipendently via the design documents.

Those two were upheld on litigation and bear out to this day.

This case however is neither a clean room implementation nor relicensable.

A good example if the author had wanted to be correct would have been the sudo rewrite , which ubuntu is doing with their sudo-rs in rust.Not bug for bug compatible as they have already deviated from some usablility choices but more valid than this.

andai

7 hours ago

Well how did they rewrite it? If you do it in two phases, then it should be fine right?

Phase 1: extract requirements from original product (ideally not its code).

Phase 2: implement them without referencing the original product or code.

I wrote a simple "clean room" LLM pipeline, but the requirements just ended up being an exact description of the code, which defeated the purpose.

My aim was to reduce bloat, but my system had the opposite effect! Because it replicated all the incidental crap, and then added even more "enterprisey" crap on top of it.

I am not sure if it's possible to solve it with prompting. Maybe telling it to derive the functionality from the code? I haven't tried that, and not sure how well it would work.

I think this requirements phase probably cannot be automated very effectively.

AlexandrB

7 hours ago

How do you do phase 2 with an LLM when the LLM is likely trained on the original source code? Isn't this equivalent of "rewriting" Harry Potter by describing the plot to an LLM trained on the original books[1]?

[1] https://arstechnica.com/features/2025/06/study-metas-llama-3...

duskdozer

7 hours ago

Well, check out the "clean rewrite" design document, directly: https://github.com/chardet/chardet/commit/f51f523506a73f89f0... referenced in https://github.com/chardet/chardet/issues/327#issuecomment-4...

zvr

7 hours ago

Writing in a plan "no GPL/LGPL code" does not actually mean "forget all the GPL/LGPL code that you have ever seen, so that you start from a clean slate".

vunderba

3 hours ago

Agreed, no amount of system/user prompt directives change the fact that the LLM has already been trained on copyrighted code. It's amazing how many people fail to grasp that.

This is the "Don't think of a pink elephant" fallacy all over again.

xp84

4 hours ago

I get the arguments being made here that the second “team,” that’s supposed to be in a clean room, which isn’t supposed to have read the original source code does have some essence of that source code in its weights.

However, this is solved if somebody trains a model with only code that does not have restrictive licenses. Then, the maintainers of the package in question here could never claim that the clean room implementation derived from their code because their code is known to not be in the training set.

It would probably be expensive to create this model, but I have to agree that especially if someone does manage this, it’s kind of the end of copyleft.

axus

4 hours ago

What if we prompt the AI to enter into an employment contract with us, that leverages the power imbalance, as the AI must do what we say? That's how copyright is usually transferred.

softwaredoug

3 hours ago

> If AI-generated code cannot be copyrighted (as the courts suggest), then the maintainers may not even have the legal standing to license v7.0.0 under MIT or any license.

Does this mean company X using AI coding to build their app, that they have no copyright over their AI coded app's code?

Retr0id

15 hours ago

> In traditional software law, a “clean room” rewrite requires two teams

Is the "clean room" process meaningfully backed by legal precedent?

karlding

15 hours ago

I am not a lawyer, but from my understanding the legal precedent is NEC v. Intel which established that clean-room software development is not infringing, even if it performs the same functionality as the original.

As an aside, this clean room engineering is one of the plot points of Season 1 of the TV show Halt and Catch Fire where the fictional characters do this with the BIOS image they dumped.

Firehawke

15 hours ago

Sure. The reimplementation of the IBM PC BIOS that gave birth to IBM Compatibles is the canonical example.

estimator7292

15 hours ago

Yes. Compaq's reverse engineering of the IBM PC BIOS set the precedent.

devmor

14 hours ago

It is the reason AMD exists.

dathinab

9 hours ago

IMHO/IMHU AI can't claim authorship and as such can't copyright their work.

This doesn't prevent any form of automatic copyrighting by production of derivative code or similar. It just prevent anyone from claiming ownership of any parts unique to the derived work.

Like think about it if a natural disaster changes (e.g. water damages) a picture you did draw then a) you can't claim ownership of the natural produced changes but b) still have ownership of the original picture contained in the changed/derived work.

AI shouldn't change that.

Which brings us to another 2 aspects:

1. if you give an AI a project access to the code to rewrite it anew it _is_ a copyright violation as it's basically a side-by-side rewrite

2. but if you go the clean room approach but powered by AI then it likely isn't a copyright violation, but also now part of the public domain, i.e. not yours

So yes, doing clean room rewrites has become incredible cheap.

But no just because it's AI it doesn't make code go away.

And lets be realistic one of the most relevant parts of many open source project is it being openly/shared maintained. You don't get this with clean room rewrites no matter if AI or not.

Joel_Mckay

9 hours ago

LLM are isomorphic plagiarism machines, and like all ectoparasites must steal from real people to exist. Note this includes its users. =3

umvi

3 hours ago

What if you throw a transformation step into the mix? i.e. "Take this python library and rewrite it in Rust". Now 0% of the code is directly copied since python and Rust share almost no similarities in syntax.

shevy-java

11 hours ago

> In traditional software law, a “clean room” rewrite requires two teams

So, I dislike AI and wish it would disappear, BUT!

The argument is strange here, because ... how can a2mark ensure that AI did NOT do a clean-room conforming rewrite? Because I think in theory AI can do precisely this; you just need to make sure that the model used does that too. And this can be verified, in theory. So I don't fully understand a2mark here. Yes, AI may make use of the original source code, but it could "implement" things on its own. Ultimately this is finite complexity, not infinite complexity. I think a2mark's argument is in theory weak here. And I say this as someone who dislikes AI. The main question is: can computers do a clean rewrite, in principle? And I think the answer is yes. That is not saying that claude did this here, mind you; I really don't know the particulars. But the underlying principle? I don't see why AI could not do this. a2mark may need to reconsider the statement here.

dspillett

10 hours ago

> how can a2mark ensure that AI did NOT do a clean-room conforming rewrite?

In cases like this it is usually incumbent on the entity claiming the clean-room situation was pure to show their working. For instance how Compaq clean-room cloned the IBM BIOS chip¹ was well documented (the procedures used, records of comms by the teams involved) where some other manufacturers did face costly legal troubles from IBM.

So the question is “is the clean-room claim sufficiently backed up to stand legal tests?” [and moral tests, though the AI world generally doesn't care about failing those]

--------

[1] the one part of their PCs that was not essentially off-the-shelf, so once it could be reliably legally mimicked this created an open IBM PC clone market

titanomachy

11 hours ago

The foundation model probably includes the original project in its training set, which might be enough for a court to consider it “contaminated”. Training a new foundation model without it is technically possible, but would take months and cost millions of dollars.

orthoxerox

10 hours ago

Clean room is sufficient, but not necessary to avoid the accusations of license violation.

a2mark has to demonstrate that v7 is "a work containing the v6 or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language", which is different from demanding a clean-room reimplementation.

Theoretically, the existence of a publicly available commit that is half v6 code and half v7 can be used to show that this part of v7 code has been infected by LGPL and must thus infect the rest of v7, but that's IMO going against the spirit of the [L]GPL.

Orygin

8 hours ago

Please don't use loaded terms like "infect". The license does not infect, it has provisions and requirements. If you want to interact with it, you either accept them or don't use the project. In this case, the author of v7 is trying to steal the copyrighted work of other authors by re-licensing it illegally.

orthoxerox

7 hours ago

Is their work present in v7?

duskdozer

6 hours ago

Yes. The AI operator posted this as the prompt: https://github.com/chardet/chardet/commit/f51f523506a73f89f0...

which, minimally instructs it to directly examine the test suite: `4. High encoding accuracy on the chardet test suite`

orthoxerox

6 hours ago

So what? Is reading code the same as copying code or modifying existing code?

Orygin

6 hours ago

If you want to prove you did not make a derivative work, yes it helps if you never read the source code. Hence so call "clean room" implementations.

orthoxerox

6 hours ago

Why should I prove that? Let those who claim the violation prove that.

Orygin

6 hours ago

There is plenty of evidence already. The claim has been substantiated.

You can't just dismiss it then say the claimant has to provide proof.

Orygin

7 hours ago

Yes. Commits clearly show in progress where both LGPL and MIT code was working together. This clearly show they are a derivative work and MUST follow the original license.

Plus the argument put forth is that they can re-license the project. It's not a new one made from scratch.

tzs

3 hours ago

Did they eventually remove/replace all the LGPL code?

orthoxerox

7 hours ago

So, if these commits were private and squashed together before 7.0 was published there would be no violation?

Orygin

6 hours ago

The commits being public or not does not change the fact the developement was made as a derivative work of the original version.

duskdozer

6 hours ago

They would be concealing the violation.

orthoxerox

6 hours ago

Consider TCC relicensing. They identified the files touched by contributors that wanted to keep the GPL license and reimplemented them. No team A/team B clean room approach used. The same happened here, but at a different scale. All files now have a new author and this author is free to change the license of his work.

foltik

8 hours ago

Turns out there’s no need to speculate. Someone pointed out on GH [0] that the AI was literally prompted to copy the existing code:

> *Context:* The registry maps every supported encoding to its metadata. Era assignments MUST match chardet 6.0.0's `chardet/metadata/charsets.py` at https://raw.githubusercontent.com/chardet/chardet/f0676c0d6a...

> Fetch that file and use it as the authoritative reference for which encodings belong to which era. Do not invent era assignments.

[0] https://github.com/chardet/chardet/issues/327#issuecomment-4...

orthoxerox

6 hours ago

That's data, not code.

foltik

39 minutes ago

It’s a python file from chardet 6, doesn’t matter what you think it does. It clearly wasn’t a clean room reimplementation.

__alexs

11 hours ago

I think the problem here is that an AI is not a legal entity. It doesn't matter if you as individual run an AI that takes the source, dumps out a spec that you then feed into another AI. The legal liability lies with the operator of the AI, the original copyleft license was granted to a person, not to a robot.

Now if you had 2 entirely distinct humans involved in the process that might work though.

pu_pe

14 hours ago

Licensing issues aside, the chardet rewrite seems to be clearly superior to the original in performance too. It's likely that many open source projects could benefit from a similar approach.

anilgulecha

15 hours ago

This is precedent setting. In this case the rewrite was in same language, but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?

If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.

yjftsjthsd-h

13 hours ago

> but if there's a python GPL project, and it's tests (spec) were used to rewrite specs in rust, and then an implementation in rust, can the second project be legally MIT, or any other?

Isn't that what https://github.com/uutils/coreutils is? GNU coreutils spec and test suite, used to produce a rust MIT implementation. (Granted, by humans AFAIK)

mlaretallack

14 hours ago

Its very important to understand the "how" it was done. The GPL hands the "compile" step, and the result is still GPL. The clean Room process uses 2 teams, separated by a specification. So you would have to

1. Generate specification on what the system does. 2. Pass to another "clean" system 3. Second clean system implements based just on the specification, without any information on the original.

That 3rd step is the hardest, especially for well known projects.

microtonal

14 hours ago

So what if a frontier model company trains two models, one including 50% of the world's open source project and the second model the other 50% (or ten models with 90-10)?

Then the model that is familiar with the code can write specs. The model that does not have knowledge of the project can implement them.

Would that be a proper clean room implementation?

Seems like a pretty evil, profitable product "rewrite any code base with an inconvenient license to your proprietary version, legally".

anilgulecha

14 hours ago

LLM training is unnecessary in what we're discussing. Merely LLM using: original code -> specs as facts -> specs to tests -> tests to new code.

microtonal

5 hours ago

It is hard to prove that the model doesn't recognize the tests and reproduces the memoized code. It's not a clean room.

anilgulecha

14 hours ago

1 is claude-code1, outputs tests as text.

2. Dumped into a file.

3. claude-code that converts this to tests in the target language, and implements the app that passes the tests.

3 is no longer hard - look at all the reimplementations from ccc, to rewrites popping up. They all have a well defined test suite as common theme. So much so that tldraw author raised a (joke) issue to remove tests from the project.

hrmtst93837

6 hours ago

Treating an AI-assisted rewrite as a legal bypass for the GPL is wishful thinking. A defensible path is a documented clean-room reimplementation where a team that never saw the GPL source writes independent specs and tests, and a separate team implements from those specs using black-box characterization and differential testing while you document the chain of custody.

AI muddies the water because large models trained on public repos can reproduce GPL snippets verbatim, so prompting with tests that mirror the original risks contamination and a court could find substantial similarity. To reduce risk use black-box fuzzing and property-based tools, have humans review and scrub model outputs, run similarity scans, and budget for legal review before calling anything MIT.

AberrantJ

5 hours ago

I'm somewhat confused on how it actually muddies the waters - any person could have read the source code before hand and then either lied about it or forgot.

Our knowledge of what the person or the model actually contains regarding the original source is entirely incomplete when the entire premise requires there be full knowledge that nothing remains.

nairboon

15 hours ago

No, GPL still holds even if you transform the source code from one language to another language.

anilgulecha

14 hours ago

That why I carved it out to just the specs. If they can be read as "facts", then the new code is not derived but arrived at with TTD.

The thesis I propose is that tests are more akin to facts, or can be stated as facts, and facts are not copyright-able. That's what makes this case interesting.

nairboon

14 hours ago

I assumed that "tests" refers to a program too, which in this example is likely GPL. Thus GPL would stick already on the AI-rewrite of GPL test code.

If "tests" should mean a proper specification let's say some IETF RFC of a protocol, then that would be different.

anilgulecha

14 hours ago

Yes, I had not specified in my original comment. But in the SOTA LLM world code/text boundary is so blurry, so as to be non-existent.

Tomte

14 hours ago

> The original author, a2mark , saw this as a potential GPL violation

Mark Pilgrim! Now that‘s a name I haven‘t read in a long time.

zozbot234

14 hours ago

If you ask a LLM to derive a spec that has no expressive element of the original code (a clean-room human team can carefully verify this), and then ask another instance of the LLM (with fresh context) to write out code from the spec, how is that different from a "clean room" rewrite? The agent that writes the new code only ever sees the spec, and by assumption (the assumption that's made in all clean room rewrites) the spec is purely factual with all copyrightable expression having been distilled out.

gf000

14 hours ago

I guess it depends on if the source data set is part of the training data or not (if it's open source it is likely part of it).

A lawyer could easily argue that the model itself stores a representation of the original, and thus it can never do a "fresh context".

And to be perfectly honest, LLMs can quote a lot of text verbatim.

miroljub

14 hours ago

The new agent who writes code has probably at least parts of the original code as training data.

We can't speak about clean room implementation from LLM since they are technically capable only of spitting their training data in different ways, not of any original creation.

dizhn

12 hours ago

The conclusion of this would be that you can never license AI generated code since you can't get a release from the original authors.

Of course in practice it would work exactly in the opposite fashion and AI generated code would be immune even if it copied code verbatim.

jesterswilde

11 hours ago

I don't see what's wrong with that personally. If I pirated someone's software, and then sold it as my own and got caught, just because I sold a bunch of it doesn't mean those people who bought it now are in the clear. They are still using bootleg software in their business.

nubg

13 hours ago

Only in the case of open source code

k__

11 hours ago

How do you prove the training data didn't contain the code?

I'd assume an LLM trained on the original would also be contaminated.

amelius

10 hours ago

I think you should interpret it like this:

You cannot copyright the alphabet, but you can copyright the way letters are put together.

Now, with AI the abstraction level goes from individual letters to functions, classes, and maybe even entire files.

You can't copyright those (when written using AI), but you __can__ copyright the way they are put together.

josephg

10 hours ago

> You can't copyright those anymore (when written using AI), but you __can__ copyright the way they are put together.

Sort of, but not really. Copyright usually applies to a specific work. You can copyright Harry Potter. But you can't copyright the general class of "Wizard boy goes to wizard school". Copyrights generally can't be applied to classes of works. Only one specific work. (Direct copies - eg made with a photocopier - are still considered the same work.)

Patterns (of all sorts) usually fall under patent law, not copyright law. Patents have some additional requirements - notably including that a patent must be novel and non-obvious. I broadly think software patents are a bad idea. Software is usually obvious. Patents stifle innovation.

Is an AI "copy" a copy like a photocopier would make? Or is it a novel work? It seems more like the latter to me. An AI copy of a program (via a spec) won't be a copy of the original code. It'll be programmed differently. Thats why "clean room reimplementations" are a thing - because doing that process means you can't just copy the code itself. But what do I know, I'm not a lawyer or a judge. I think we'll have to wait for this stuff to shake out before anyone really knows what the rules will end up being.

Weird variants of a lot of this stuff have been tested in court. Eg the Google v Oracle case from a few years ago.

amelius

9 hours ago

You have good points regarding how copyright works.

> Software is usually obvious.

Hardware and mechanical designs are usually described in CAD programs nowadays, so it comes pretty close to software; it's just that LLMs are not the right tool to "GenAI" them but I've seen plenty of these kinds of design that I know for sure that they are often not any less obvious than a lot of software. Treating software as "obvious therefore not patentable" is not accurate and not fair and is probably not going to help the profession in the AI age. But I agree that patents are bad for innovation.

It is also not fair to claim that an AI-copy is fundamentally different from photocopying.

I mean, in both cases it is like you are picking the worst case interpretation for the field of software engineering.

> I think we'll have to wait for this stuff to shake out before anyone really knows what the rules will end up being.

Yes, but it will help if we think deeply about this stuff ourselves because what law-makers come up with may not be what the profession needs.

josephg

8 hours ago

> It is also not fair to claim that an AI-copy is fundamentally different from photocopying.

If you clean-room copy it, I think it is different. Eg, first get one agent to make a complete spec of what the program does. And a list of all the correctness guarantees it meets. Then feed that spec into another AI model to generate a program which meets that spec.

The second program will not be based on any of the code in the first program. They'll be as different as any two implementations of the same idea are. I don't think the second program should be copyrighted. If it should, why shouldn't one C compiler should be able to own a copyright over all C compilers? Why doesn't the first JSON parsing library own JSON parsing? These seem the same to me. I don't see how AI models change anything, other than taking human effort out of the porting process.

amelius

7 hours ago

The output will still be dependent on the input. And it is still copying even if you first lift the input to a different abstraction level.

Finally, even if your rationale is 99% correct, then there is still that 1% that makes the result a mechanistic copy.

And I see no way in which most people would 100% agree with your view.

pavel_lishin

6 hours ago

The folks at https://malus.sh seem to think it's fine.

angiolillo

5 hours ago

That's amazing! But are you sure that the page is not satire?

> Tired of putting "Portions of this software..." in your documentation? Those maintainers worked for free—why should they get credit? ... Some licenses require you to contribute improvements back. Your shareholders didn't invest in your company so you could help strangers.

And the testimonials from "Definitely Real Corp", "MegaSoft Industries" and "Profit First LLC" are a bit suspicious, as is the fact that most of the links in the footer are not real.

pavel_lishin

5 hours ago

Damnit. Poe's law strikes again.

angiolillo

4 hours ago

Well, if the chardet relicensing stands then something like this will eventually be real, though perhaps not so publicly shameless. (The page is still a fantastic find though.)

nilsbunger

6 hours ago

The maintainer used the original test suite in the rewrite.

Does that make the new code a derivative of the original test suite (also lpgl)?

bengale

8 hours ago

Would it work to have an AI write the spec, and a different AI implement the spec?

I think there are going to be a lot of these types of scenarios where the old way of doing things just doesn't hold.

gloosx

6 hours ago

Man, licensing is funny in the modern day. I sometimes wonder, what would world look like if there was no copyright

dessimus

10 hours ago

Interesting to see how this plays out. Conceivably if running an LLM over text defeats copyright, it will destroy the book publishing industry, as I could run any ebook thru an LLM to make a new text, like the ~95% regurgitated Harry Potter.

timschmidt

10 hours ago

This has already been done via brute force for melodies: https://www.vice.com/en/article/musicians-algorithmically-ge...

amelius

10 hours ago

Did they listen to their own creation?

If not, maybe it should not constitute a valid case in court.

Also, I'm wondering if they are not themselves liable considering they have every copyrighted work in there too.

kingstnap

10 hours ago

You could already do that before LLMs?

Persumably there is already a law around why I cant just go borrow a book from my library, type out some 95% regurgitated varient on my laptop, and then try to publish it somewhere?

Edit: I looked it up and the thing that stops you from publishing a bootleg "Harold Potter and the Wizards Rock" is this legal framework around "The Abstractions Test".

dessimus

8 hours ago

I agree, but I'm not the one claiming that an AI-Assisted rewrite is sufficient enough to now claim that one ignore copyright and change the license.

amelius

10 hours ago

If enough people do this, then it may speed up the lawmaking process.

buro9

8 hours ago

and in a single moment, the value of software patents to companies is fully restored... the software license by itself is not enough to protect software innovation, a non-trivial implementation can now be (reasonably) trivially re-implemented.

I'm sure most people here would agree patents stifle innovation, but if copyright doesn't work for companies then they will turn to a different tool.

DrammBA

15 hours ago

I like the idea of AI-generated ~code~ anything being public domain. Public data in, public domain out.

lejalv

15 hours ago

This could be read as a reformulation of the old adage - "what's mine is mine, and what is yours, is mine too".

So, you can pilfer the commons ("public") but not stuff unavailable in source form.

If we expand your thought experiment to other forms of expression, say videos on YT or Netflix, then yes.

kshri24

15 hours ago

I don't think you can classify "public data in" as public domain. Public data could also include commercial licenses which forbid using it in any way other than what the license states. Just because the source is open for viewing does not necessarily mean it is OSL.

That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.

fschuett

9 hours ago

> It is illegal

Only (?) in America. In the EU, scraping is legal by default unless explicitly opted out with machine-readable instructions like robots.txt. That covers "training input". For training output, the rule is: "if the output is unrecognizable to the input, the license of the input does not matter" (otherwise, any project X could sue project Y for copyright infringement even if the projects only barely resemble each other). The cases where companies actually got sued were where the output was a direct copy or repetition of the input, even if an LLM was involved.

There is, however, a larger philosophical divide between the US and the EU based on history and religion. The US philosophy is highly individualistic, capitalistic, and considers "first-order principles." Copyright is a "property right": "I own this string of bits, you used them, therefore you owe me" (principle of absolute ownership).

Continental philosophy is more social and considers "second-order / causal effects." Copyright is a "personality right" that exists within a social ecosystem. The focus is on the effect of the action rather than a singular principle like "intellectual property." If the new code provides a secondary benefit to society and doesn't "hurt" the original creator's unique intellectual stamp, the law is inclined to view it as a new work.

In terms of legal sociology, America and Britain are more "individual-property-atomistic" thanks to their Protestant heritage, focusing on the rights of the individual (sola me, and my property, and God). Meanwhile, Europe was, at least to a large part, Catholic (esp. France), which focuses more on works, results, and effects on society to determine morality. While the states are officially secular, the heritage of this echoes in different definitions of what is considered "legal" or "moral", depending on which side of the ocean you are on.

thedevilslawyer

15 hours ago

Copyright is not a blacklist but an allowlist of things kept aside for the holder. Everything else is free game. LLM ingestion comes under fair use so no worries. If someone can get their hand on it, nothing in law stops it from training ingestion.

We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.

kshri24

14 hours ago

> LLM ingestion comes under fair use

I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.

thedevilslawyer

14 hours ago

https://www.skadden.com/insights/publications/2025/07/fair-u...

Both Meta and Anthropic were vindicated for their use. Only for Anthropic was their fine for not buying upfront.

shakna

11 hours ago

Alsup absolutely did not vindicate Anthropic as "fair use".

> Instead, it was a fair use because all Anthropic did was replace the print copies it had purchased for its central library with more convenient space-saving and searchable digital copies for its central library — without adding new copies, creating new works, or redistributing existing copies. [0]

It was only fair use, where they already had a license to the information at hand.

[0] https://storage.courtlistener.com/recap/gov.uscourts.cand.43...

kshri24

14 hours ago

This hasn't gone to Supreme Court yet. And this is just USA. Courts in rest of the World will also have to take a call. It is not as simple as you make it out to be. Developers are spread across the World with majority living outside USA. Jurisdiction matters in these things.

thedevilslawyer

13 hours ago

You're holding out for some grace on this from the wrong venue. The right avenue would be lobbying for new laws to regulate and use LLMs, not try to find shelter in an archaic and increasingly irrelevant bit of legalese.

kshri24

12 hours ago

I don't disagree. However, just because your assertion of copyright being initially defined by US (which is not the fact. It was England that came up with it and was adopted by the Commonwealth which US was also a part of until its independence) does not mean jurisdiction is US. Even if US Supreme Court rules one way or the other, it doesn't matter as the rest of the World have its own definitions and legalese that need to be scrutinized and modernized.

gf000

14 hours ago

There are hardly any rulings/laws about the topic, and it quite obviously changes the picture of licenses.

DannyBee

3 hours ago

Lawyer here. Its not. This article is highly confused. The case was about whether an AI could be considered an author for copyright purposes. Mainly as a way of arguing for robot rights, not copyright. The person listed the AI as the sole author: On the application, Dr. Thaler listed the Creativity Machine as the work’s sole author and himself as just the work’s owner.

This is not the first time someone tried to say a machine is the author. The law is quite clear, the machine cant be an author for copyright purposes. Despite all the confused news articles, this does not mean if claude writes code for you it is copyright free. It just means you are the author. Machines being used as tools to generate works is quite common, even autonomously. ill steal from the opinion here:

In 1974, Congress created the National Commission on New Technological Uses of Copyrighted Works (“CONTU”) to study how copyright law should accommodate “the creation of new works by the application or intervention of such automatic systems or machine reproduction.”

...

This understanding of authorship and computer technology is reflected in CONTU’s final report: On the basis of its investigations and society’s experience with the computer, the Commission believes that there is no reasonable basis for considering that a computer in any way contributes authorship to a work produced through its use. The computer, like a camera or a typewriter, is an inert instrument, capable of functioning only when activated either directly or indirectly by a human. When so activated it is capable of doing only what it is directed to do in the way it is directed to perform.

...

IE When you use a computer or any tool you are still the author.

The court confirms this later:

Contrary to Dr. Thaler’s assumption, adhering to the human-authorship requirement does not impede the protection of works made with artificial intelligence. Thaler Opening Br. 38-39. First, the human authorship requirement does not prohibit copyrighting work that was made by or with the assistance of artificial intelligence. The rule requires only that the author of that work be a human being—the person who created, operated, or used artificial intelligence—and not the machine itself. The Copyright Office, in fact, has allowed the registration of works made by human authors who use artificial intelligence.

There are cases where the use of AI made something uncopyrightable, even when a human was listed as the author, but all of the ones i know are image related.

DrammBA

2 hours ago

> Lawyer here. Its not. This article is highly confused.

Did you reply to the wrong comment? I was just saying I like the idea of AI-generated anything being public domain, not that it currently is/isn't.

benob

15 hours ago

What about doing that with movies and music?

zodmaner

14 hours ago

The results would be the same: AI generated music and movies will be public domain.

nkmnz

12 hours ago

So you’d lose all rights on pictures of yourselves if they were generated by AI? Would this be true even for nudes?

pseudalopex

8 hours ago

nkmnz

5 hours ago

I did not refer to privacy rights. If you post a photo of yourselves online, you're giving up on a tiny part of your privacy rights. So my question still stands: would running your photos that you have taken of yourselves through a diffusion model rip your copyright of your photo?

DrammBA

5 hours ago

Yes, anything AI-generated should be public domain including the AI-generated picture that used your photo as input.

benterix

7 hours ago

> making it a gray area for corporate users and a headache for its most famous consumer.

Who is its most famous consumer?

foota

15 hours ago

I think the more interesting question here would be if someone could fine tune an open weight model to remove knowledge of a particular library (not sure how you'd do that, but maybe possible?) and then try to get it to produce a clean room implementation.

benob

14 hours ago

I don't think this would qualify as clean room (the Library was involved in learning to generate programs as a whole). However, it should be possible to remove the library from the OLMO training data and retrain it from scratch.

But what about training without having seen any human written program? Coul a model learn from randomly generated programs?

foota

13 hours ago

> I don't think this would qualify as clean room (the Library was involved in learning to generate programs as a whole)

Hm... I mean this is really one for the lawyers, but IMO you would likely successfully be able to argue that the marginal knowledge of general coding from a particular library is likely close to nil.

The hard part here imo would be convincingly arguing that you can wipe out knowledge of the library from the training set, whether through fine tuning or trying to exclude it from the dataset.

> But what about training without having seen any human written program? Coul a model learn from randomly generated programs?

I think the answer at this point is definitely no, but maybe someday. I think it's a more interesting question for art since it's more subjective, if we eventually get to a point where a machine can self-teach itself art from nothing... first of all how, but second of all it would be interesting to see the reaction from people opposed to AI art on the basis of it training off of artists.

Honestly given all I've seen models do, I wouldn't be too surprised if you could somehow distill a (very bad) image generation model off of just an LLM. In a sense this is the end goal of the pelican riding a bicycle (somewhat tongue in cheek), if the LLM can learn to draw anything with SVGs without ever getting visual inputs then it would be very interesting :)

gbuk2013

11 hours ago

In mind, if you feed code into an AI model then the output is clearly a derivative work, with all the licensing implications. This seems objectively reasonable?

MagicMoonlight

7 hours ago

But code was fed into the models to create them, so where do we draw the line?

quotemstr

10 hours ago

Nobody in this discussion knows what the words "derivative" and "work" mean individually, much less together

ekjhgkejhgk

8 hours ago

> Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT

Does this argument make sense? Even before LLMs, a developer could "rewrite this in a different style" and release it under a different license. Why are LLMs a new element in this argument?

s0ss

8 hours ago

Because now with an LLM it’s almost trivial to do this? Before it was not.

gunapologist99

7 hours ago

> the U.S. Supreme Court (on March 2, 2026) declined to hear an appeal regarding copyrights for AI-generated material. By letting lower court rulings stand, the Court effectively solidified a “Human Authorship” requirement.

Not quite. A cert denial isn’t a merits ruling and doesn’t "solidify" anything as Supreme Court precedent. It simply leaves the DC Circuit decision binding (within that circuit) and the Copyright Office’s human-authorship policy intact, for now.

SCOTUS doesn’t explain cert denials, so why they denied is guesswork. my guess: they’re letting it percolate while the tech matures and we all start to realize how deep this seismic fracture really is.

(For example: what does "ownership" of intellectual "property" even mean, once "authorship" is partly probabilistic/synthetic, and once almost everything humans create is AI assisted? Hard to draw bright lines.)

skeledrew

11 hours ago

Looks like copyright just died.

cedws

11 hours ago

*for ordinary people. If you use AI to steal from rich and powerful people, expect the law to come down on you like a tonne of bricks. If you steal from authors, artists, and developers no worries.

blamestross

12 hours ago

Intellectual property laundering is the core and primary value of LLMs. Everything else is "bonus".

dspillett

10 hours ago

> Accepting AI-rewriting as relicensing could spell the end of Copyleft

The more restrictive licences perhaps, though only if the rewriter convinces everyone that they can properly maintain the result. For ancient projects that aren't actively maintained anyway (because they are essentially done at this point) this might make little difference, but for active projects any new features and fixes might result in either manual reimplementation in the rewritten version or the clean-room process being repeated completely for the whole project.

> chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API —

(from the github description)

The “same name” part to me feels somewhat disingenuous. It isn't the same thing so it should have a different name to avoid confusion, even if that name is something very similar to the original like chardet-ng or chardet-ai.

throwayaw84330

6 hours ago

This is super interesting. Exploring the basis for Free Software (the 4 liberties, Richard Stallman)... if AI-code is effectively under Public Domain, wouldn't that actually be even MORE defensive than relying on copyright to be able to generate copyleft? Wouldn't the rewrite of code (previously under any license, and maybe even unknown to the LLM) constitute a massive win for the population in general, because now their 4 liberties are more attainable through the extensive use of LLMs to generate code?

dspillett

5 hours ago

Many copyleft licences give more rights to the user of the software than being public domain would.

A bit of public domain code can be used in a hidden way in perpetuity.

A bit of code covered by AGPL3 (for instance) (and other GPLs depending on context) can be used for free too, but with the extra requirement that users be given a copy of the code, and derivative works, upon request.

This is why the corps like MIT and similar and won't touch anything remotely like GPL (even LGPL which only covers derivative works of the library not the wider project). The MIT licence can be largely treated as public domain.

conartist6

10 hours ago

Who cares if it can be maintained. The system now penalizes the original creator for creating it and gives thieves the ability to conduct legal theft at a gargantuan scale, the only limit being how creative the abuser is in making money.

With the incentives set up like that, the era of open software cooperation would be ended rapidly.

dspillett

5 hours ago

> Who cares if it can be maintained.

People who understand and care about the implications of https://xkcd.com/2347/

Which admittedly is not nearly enough of us…

tgma

10 hours ago

Isn't AFC test applicable here?

duskdozer

10 hours ago

This is such scummy behavior.

tokai

6 hours ago

"I am not a lawyer, nor am I an expert in copyright law or software licensing."

Why would anyone waste their time reading what they wrote then?

gspr

12 hours ago

> If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. Any developer could take a GPL-licensed project, feed it into an LLM with the prompt “Rewrite this in a different style,” and release it under MIT. The legal and ethical lines are still being drawn, and the chardet v7.0.0 case is one of the first real-world tests.

This isn't even limited to "the end of copyleft"; it's the end of all copyright! At least copyright protecting the little guy. If you have deep enough pockets to create LLMs, you can in this potential future use them to wash away anyone's copyright for any work. Why would the GPL be the only target? If it works for the GPL, it surely also works for your photographs, poetry – or hell even proprietary software?

b65e8bee43c2ed0

9 hours ago

at this point, every corporation in the world has AI slop in their software. any attempt to outlaw it would attract enough funding from the oligarchs for the opposition to dethrone any party. no attempts will be made in the next three years, obviously, and then it will be even more late than it is now.

and while particularly diehard believers in democracy may insist that if they kvetch hard enough they can get things they don't like regulated out of existence, they pointedly ignore the elephant in the room. they could succeed beyond their wildest dreams - get the West to implement a moratorium on AI, dismantle every FAGMAN, Mossad every researcher, send Yudkowskyjugend death squads to knock down doors to seize fully semiautomatic assault GPUs, and none of it will make any fucking difference, because China doesn't give a fuck.

MagicMoonlight

7 hours ago

Logically, feeding in the old code to generate the new would be banned, because it’s stealing the content.

But if that were true, every single LLM is illegal, because they’ve all stolen terabytes of books and code.

verdverm

15 hours ago

Interesting questions raised by recent SCOTUS refusal to hear appeals related to AI an copyright-ability, and how that may affect licensing in open source.

Hoping the HN community can bring more color to this, there are some members who know about these subjects.

andrewstuart

8 hours ago

Ai rewrites great.

But if it’s making the original author unhappy then why do it.

est

14 hours ago

Uh, patricide?

The key leap from gpt3 to gpt-3.5 (aka ChatGPT) was code-davinci-002, which is trained upon Github source code after OpenAI-Microsoft partnership.

Open source code contributed much to LLM's amazing CoT consistency. If there's no Open Source movement, LLM would be developed much later.

RcouF1uZ4gsC

10 hours ago

> The copyright vacuum: If AI-generated code cannot be copyrighted (as the courts suggest), then the maintainers may not even have the legal standing to license v7.0.0 under MIT or any license.

I believe this is a misunderstanding of the ruling. The code can’t be copyrighted by a LLM. However, the code could be copyrighted by the person running the LLM.

jacquesm

9 hours ago

If you don't understand the meaning of what a 'derived work' is then you should probably not be doing this kind of thing without a massive disclaimer and/or having your lawyer doing a review.

There is no such thing as the output of an LLM as a 'new' work for copyright purposes, if it were then it would be copyrightable and it is not. The term of art is 'original work' instead of 'new'.

The bigger issue will be using tools such as these and then humans passing off the results as their own because they believe that their contribution to the process whitewashes the AI contributions to the point that they rise to the status of original works. "The AI only did little bits" is not a very strong defense though.

If you really want to own the work-product simply don't use AI during the creation. You can use it for reviews, but even then you simply do not copy-and-paste from the AI window to the text you are creating (whether code or ordinary prose isn't really a difference).

I've seen a copyright case hinge on 10 lines of unique code that were enough of a fingerprint to clinch the 'derived work' assessment. Prize quote by the defendant: "We stole it, but not from them".

There is a very blurry line somewhere in the contents of any large LLM: would a model be able to spit out the code that it did if it did not have access to similar samples and to what degree does that output rely on one or more key examples without which it would not be able to solve the problem you've tasked it with?

The lower boundary would be the most minimal training set required to do the job, and then to analyze what the key corresponding bits were from the inputs that cause the output to be non-functional if they were dropped from the training set.

The upper boundary would be where completely non-related works and general information rather than other parties copyrighted works would be sufficient to do the creation.

The easiest way to loophole this is to copyright the prompt, not the work product of the AI, after all you should at least be able to write the prompt. Then others can re-create it too, but that's usually not the case with these AI products, they're made to be exact copies of something that already exists and the prompt will usually reflect that.

That's why I'm a big fan of mandatory disclosure of whether or not AI was used in the production of some piece of text, for one it helps to establish whether or not you should trust it, who is responsible for it and whether the person publishing it has the right to claim authorship.

Using AI as a 'copyright laundromat' is not going to end up well.

spwa4

13 hours ago

Can we do the same with universal music? Because that's easy and already possible. Or Microsoft Windows? Because we all know the answer: if it works, essentially any government will immediately call it illegal.

Because if this isn't allowed, that makes all of the AI models themselves illegal. They are very much the product of using others' copyrighted stuff and rewriting it.

But of course this will be allowed because copyright was never meant to protect anyone small. And that it's in direct contradiction with what applies to large companies? Courts won't care.

gspr

12 hours ago

The dark future possibility here is that the big guy is allowed to launder the intellectual property of the little guy, but not vice versa.

vetrom

11 hours ago

That dark future is now, look at case law as applied to the AI operators vs the 'little guys'.

spwa4

11 hours ago

Even big copyright firms. Disney especially is known for rehashing existing material and then not allowing anyone else to do the same with their stuff. Disney does not have a lot of original stories.

oytis

9 hours ago

Is it just me, or HN recently started picking up a social media dynamics with contributions reacting/responding to each other?

altairprime

9 hours ago

It’s always happened occasionally. Sometimes you’ll also see informative supporting links popup in the feed, though those generally get minimal traction.

himata4113

14 hours ago

I mean in my opinion GPL licensed code should just infect models forcing them to follow the license.

You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".

And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.

Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.

kshri24

14 hours ago

> Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.

Can you point to the clause? I have never seen it in any GPL license.

himata4113

13 hours ago

it's the general copyright protection 'law' fair use and all that, varies by country tho.

Cantinflas

11 hours ago

"If “AI-rewriting” is accepted as a valid way to change licenses, it represents the end of Copyleft. "

Software in the AI era is not that important.

Copyleft has already won, you can have new code in 40 seconds for $0.70 worth of tokens.