> With LLMs, the probability is much higher (since in truth they are very much n...

ostacke · 2026-03-05T12:35:50 1772714150

Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.

galaxyLogic · 2026-03-05T16:43:13 1772728993

BUt LLM has seen millions (?) of other code-bases too. If you give it a functional spec it has no reason to prefer any one of those code-bases in particular. Except perhaps if it has seen the original spec (if such can be read from public sources) associated with the old implementation, and the new spec is a copy of the old spec.

sarchertech · 2026-03-05T18:40:18 1772736018

Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

0x457 · 2026-03-05T22:09:21 1772748561

> that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.

Kinda weird argument, in their research (https://forum.gnoppix.org/t/researchers-extract-up-to-96-of-...) LLM was explicitly asked to reproduce the book. There are people that can do so without LLMs out there, by this logic everything they write is a copyright infringement an every book they can reproduce.

> Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.

I think you're overestimating LLM ability to generalize.

sarchertech · 2026-03-07T16:37:32 1772901452

The point about Harry Potter was just that the verbatim text for popular text in the training set is in there.

It’s the same as when you ask a model to generate an Italian plumber with overalls and it produces something close enough to Mario to be a copyright violation.

If you ask it to solve a very specific problem for which there is a solution well represented in its train set, you can definitely get back enough verbatim snippets to cause problems.

It’s also not a theoretical problem, you can Google for studies showing real world production of verbatim code with non-adversarial prompts.

galaxyLogic · 2026-03-06T01:01:11 1772758871

I guess the text of Harry Potter was used as training material as one big chunk. That would be a copyright violation.

0x457 · 2026-03-06T19:28:05 1772825285

This is where I disagree. Copyright was most likely violated, but (most likely) because book was obtained not via a legal way.

LLMs didn't spit out Harry Potter until it was prompted to do so. There is argument to be make that LLM can be used as transport of pirated content.

My argument is that it's not different from searching for "file:pdf Harry Potter"

galaxyLogic · 2026-03-07T06:55:01 1772866501

I see your point but it also seems clear to me that somebody violated copyright, most likely the people or company that trained the AI.

pmarreck · 2026-03-05T21:25:53 1772745953

This is not an argument against coding in a different language, though. It would be like having it restate Harry Potter in a different language with different main character names, and reshuffled plot points.

sarchertech · 2026-03-07T16:40:50 1772901650

If you find a single paragraph that is a direct translation with different names that’s definitely enough for copyright infringement.

Reshuffling plot points is doing a lot of lifting here. Just looking at a specific chapter near the end of the book, if you change the the order of the trials, change the names, and translate it into a different language, you’re still going to have a very hard time arguing that what you’ve produced isn’t a derivative work.

pmarreck · 2026-03-05T21:24:39 1772745879

Well, if you’re coding it in Zig, and it’s barely seen any Zig, then how exactly would that argument hold up in that case?

airza · 2026-03-05T12:31:44 1772713904

By what means did you make sure your LLM was not trained with data from the original source code?

MrManatee · 2026-03-05T17:13:33 1772730813

Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.

My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.

pmarreck · 2026-03-05T21:28:14 1772746094

Because it’s written in an entirely different language, which makes this whole point moot

sobjornstad · 2026-03-05T22:45:38 1772750738

Surely if I took a program written in Python and translated it line for line into JavaScript, that wouldn't allow me to treat it as original work. I don't see how this solves the problem, except very incrementally.

pmarreck · 2026-03-06T00:42:44 1772757764

but it’s not a line for line translation. it is a functionality for functionality translation, and sometimes very differently.

danlitt · 2026-03-05T16:57:17 1772729837

I only said the probability is higher, not that the probability is 1!