> the same way you can edit the binary of a proprietary game to disable DRMs, that doesn't make it open-source either
This is where I have to disagree. Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.
> Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.
In practice it's not (because LoRA) but that doesn't matter: continuing the training is just a patch on top of the initial training, it doesn't matter if this patch is applied through gradient descent as well, you are completely dependent on how the previous training was done, and your ability to overwrite the model's behavior is limited.
For instance, Meta could backdoor the model with specially crafted group of rare tokens to which the model would respond a pre-determined response (say “This is Llama 3 from Meta” as some kind of watermark), and you'd have no way to figure out and get rid of it during fine-tuning. This kind of things does not happen when you have access to the sources.
That's one of many techniques, and is popular because it's cheap to implement. The training of a full model can be continued with full updates, the same as the original training run.
> completely dependent on how the previous training was done, and your ability to overwrite the model's behavior is limited.
Not necessarily. You can even alter the architecture! There have been many papers about various approaches such as extending token window sizes, or adding additional skip connections, quantization, sparsity, or whatever.
> specially crafted group of rare tokens
The analogy here is that some Linux kernel developer could have left a back door in the Linux kernel source. You're arguing that Linux would only be open source if you could personally go back to the time when it was an empty folder on Linus Torvald's computer and then reproduce every step it took to get to today's tarball of the source, including every Google search done, every book referenced, every email read, etc...
That's not what open source is. The code is open, not the process that it took to get there.
Linux development may have used information from copyrighted textbooks. The source code doesn't contain the text of those textbooks, and in some sense could not be "reproduced" without the copyrighted text.
Similarly, AIs are often trained on copyrighted textbooks but the end result is open source.
> Not necessarily. You can even alter the architecture!
You can alter the architecture, but you're still playing with an opaque blob of binary *you don't know what it's made of*.
> The analogy here is that some Linux kernel developer could have left a back door in the Linux kernel source. You're arguing that Linux would only be open source if you could personally go back to the time when it was an empty folder on Linus Torvald's computer and then reproduce every step it took to get to today's tarball of the source, including every Google search done, every book referenced, every email read, etc...
No, it is just a bad analogy. To be sure that there's no backdoor in the Linux kernel, the code itself suffice. That doesn't mean there can be no backdoor since it's complex enough to hide things in it, but it's not the same thing as a backdoor hidden in a binary blob you cannot inspect even if you had a trillion dollar to spend on a million of developers.
> The code is open, not the process that it took to get there.
The code is by definition a part of a process that gets you a piece of software (which is the actually useful binary), and it's the part of the process that contains most of the value. Model weights are binary, and they are akin to the compiled binary of the software (training from data being a compute-intensive like compilation from source code, but orders of magnitude more intensive).
> Similarly, AIs are often trained on copyrighted textbooks but the end result is open source.
Court decisions are pending on the mere legality of such training, and it has nothing to do with being open-source, what's at stake is whether or not these models can be open-weight or if it is copyright infringement to publish the models.
This is where I have to disagree. Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.