> With LLMs, the probability is much higher (since in truth they are very much not a "clean room" at all).
I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)
I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec
Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.
With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.
Bu that's not really what danlitt said, right? They did not claim that it's impossible for an LLM to generate something different, merely that it's not a clean room implementation since the LLM, one must assume, is trained on the code it's re-implementing.
BUt LLM has seen millions (?) of other code-bases too. If you give it a functional spec it has no reason to prefer any one of those code-bases in particular. Except perhaps if it has seen the original spec (if such can be read from public sources) associated with the old implementation, and the new spec is a copy of the old spec.
Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.
Researchers have shown that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.
> that an LLM was able to reproduce the verbatim text of the first 4 Harry Potter books with 96% accuracy.
Kinda weird argument, in their research (https://forum.gnoppix.org/t/researchers-extract-up-to-96-of-...) LLM was explicitly asked to reproduce the book. There are people that can do so without LLMs out there, by this logic everything they write is a copyright infringement an every book they can reproduce.
> Yes if you are solving the exact problem that the original code solved and that original code was labeled as solving that exact problem then that’s very good reason for the LLM to produce that code.
I think you're overestimating LLM ability to generalize.
The point about Harry Potter was just that the verbatim text for popular text in the training set is in there.
It’s the same as when you ask a model to generate an Italian plumber with overalls and it produces something close enough to Mario to be a copyright violation.
If you ask it to solve a very specific problem for which there is a solution well represented in its train set, you can definitely get back enough verbatim snippets to cause problems.
It’s also not a theoretical problem, you can Google for studies showing real world production of verbatim code with non-adversarial prompts.
This is not an argument against coding in a different language, though. It would be like having it restate Harry Potter in a different language with different main character names, and reshuffled plot points.
If you find a single paragraph that is a direct translation with different names that’s definitely enough for copyright infringement.
Reshuffling plot points is doing a lot of lifting here. Just looking at a specific chapter near the end of the book, if you change the the order of the trials, change the names, and translate it into a different language, you’re still going to have a very hard time arguing that what you’ve produced isn’t a derivative work.
Exactly - it very likely was trained on it. I tried this with Opus 4.6. I turned off web searches and other tool calls, and asked it to list some filenames it remembers being in the 7-zip repo. It got dozens exactly right and only two incorrect (they were close but not exact matches). I then asked it to give me the source code of a function I picked randomly, and it got the signature spot on, but not the contents.
My understanding of cleanroom is that the person/team programming is supposed to have never seen any of the original code. The agent is more like someone who has read the original code line by line, but doesn't remember all the details - and isn't allowed to check.
Surely if I took a program written in Python and translated it line for line into JavaScript, that wouldn't allow me to treat it as original work. I don't see how this solves the problem, except very incrementally.
I beg to differ. Please examine any of my recent codebases on github (same username); I have cleanroom-reimplemented par2 (par2z), bzip2 (bzip2z), rar (rarz), 7zip (z7z), so maybe I am a good test case for this (I haven't announced this anywhere until now, right here, so here we go...)
https://github.com/pmarreck?tab=repositories&type=source
I was most particular about the 7zip reimplementation since it is the most likely to be contentious. Here is my repo with the full spec that was created by the "dirty team" and then worked off of by the LLM with zero access to the original source: https://github.com/pmarreck/7z-cleanroom-spec
Not only are they rewritten in a completely different language, but to my knowledge they are also completely different semantically except where they cannot be to comply with the specification. I invite you and anyone else to compare them to the original source and find overt similarities.
With all of these, I included two-way interoperation tests with the original tooling to ensure compatibility with the spec.