Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Is “reproducibility” actually the right term here?

It’s a bit like arguing that Linux is not open source because you don’t have every email Linus and the maintainers ever received. Or that you don’t know what lectures Linus attended or what books he’s read.

The weights “are the thing” in the same sense that the “code is the thing”. You can modify open code and recompile it. You can similarly modify weights with fine tuning or even architectural changes. You don’t need to go “back to the beginning” in the same sense that Linux would continue to be open source even without the Git history and the LKM mailing list.



> It’s a bit like arguing that Linux is not open source because you don’t have every email Linus and the maintainers ever received. Or that you don’t know what lectures Linus attended or what books he’s read.

Linux is open source, because you can actually compile it yourself! You don't need Linus's email for that (and if you needed some secret cryptographic key on Linus' laptop to decrypt and compile the kernel, then it wouldn't make sense to call it open-source either).

A language model isn't a piece of code, it's a huge binary blob that's being executed by a small piece of code that contains little of the added value, everything that matters is in the blob. Sharing only the compiled blob and the code to run makes it unsuitable for an “open source qualifier” (It's kind of the same thing as proprietary Java code: the VM is open-source but the bytecode you run on it isn't).

And yes, you can fine-tune and change things in the model weights themselves the same way you can edit the binary of a proprietary game to disable DRMs, that doesn't make it open-source either. Fine tuning doesn't give you the same level of control over the behavior of the model as the initial training does, like binary hacking doesn't give you the same control as having the source code to edit and rebuild.


It's a blob that costs over $10,000,000 in electricity costs to compile. Even if they released everything only the rich could push go.


There is an argument to be made about the importance of archeological preservation of the provenance of models, especially the first few important LLMs, for study by future generations.

In general, software rot is a huge issue, and many projects which may be of future archeological importance are increasingly non-reproducible as dependencies are often not vendored and checked into source, but instead downloaded at compile time from servers which lack strong guarantees about future availability.


This is comment is cooler than my Arctic Vault badge on GitHub.

Who were the countless unknown contemporaries of Giotto and Cimabue? Of Da Vinci and Michelangelo? Most of what we know about Renaissance art comes from 1 guy - Giorgio Vasari. We have more diverse information about ancient Egypt than the much more recent Italian Renaissance because of, essentially, better preservation techniques.

Compliance, interoperability, and publishing platforms for all this work (HuggingFace, Ollama, GitHub, HN) are our cathedrals and clay tablets. Who knows what works will fill the museums of tomorrow.


In today's Dwarkesh interview, Zuckerberg talks about energy becoming a limit for future models before cost or access to hardware does. Apparently current largest datacenters consume about 100MW, but Zuck is considering future ones consuming 1GW which is the output of typical nuclear reactor!

So, yeah, unless you own your own world-class datacenter, complete with the nuclear reactor necessary to power the training run, then training is not an option.


On a sufficiently large time scale the real limit on everything is energy. “Cost” and “access to hardware” are mere proxies for energy available to you. This is the idea behind the Kardashev scale.


A bit odd to see this downvoted... I'm not exactly a HN newbie, but still haven't fully grasped the reasons people often downvote here - simply not liking something (regardless of relevance or correctness) seems to often be the case, and perhaps sometimes even more petty reasons.

I think Zuck's discussion of energy being the limiting factor was one of the more interesting and surprising things to come out of the Dwarkesh interview. We're used to discussion of the $1B, $10B, $100B training runs becoming unsustainable, and chip shortages as an issue, but (to me at least!) it was interesting to see Zuck say that energy usage will be a disruptor before those do (partly because of lead times and regulations in expanding power supply, and bringing it in to new data centers). The sheer magnitude of projected power consumption needed is also interesting.


There is an odd contingent or set of contingents on here that do seem to down vote by ideology rather than lack of facts or lack of courtesy. It's a bit of a shame, but I'm not sure there's much to be done.


> the same way you can edit the binary of a proprietary game to disable DRMs, that doesn't make it open-source either

This is where I have to disagree. Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.


> Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.

In practice it's not (because LoRA) but that doesn't matter: continuing the training is just a patch on top of the initial training, it doesn't matter if this patch is applied through gradient descent as well, you are completely dependent on how the previous training was done, and your ability to overwrite the model's behavior is limited.

For instance, Meta could backdoor the model with specially crafted group of rare tokens to which the model would respond a pre-determined response (say “This is Llama 3 from Meta” as some kind of watermark), and you'd have no way to figure out and get rid of it during fine-tuning. This kind of things does not happen when you have access to the sources.


> (because LoRA)

That's one of many techniques, and is popular because it's cheap to implement. The training of a full model can be continued with full updates, the same as the original training run.

> completely dependent on how the previous training was done, and your ability to overwrite the model's behavior is limited.

Not necessarily. You can even alter the architecture! There have been many papers about various approaches such as extending token window sizes, or adding additional skip connections, quantization, sparsity, or whatever.

> specially crafted group of rare tokens

The analogy here is that some Linux kernel developer could have left a back door in the Linux kernel source. You're arguing that Linux would only be open source if you could personally go back to the time when it was an empty folder on Linus Torvald's computer and then reproduce every step it took to get to today's tarball of the source, including every Google search done, every book referenced, every email read, etc...

That's not what open source is. The code is open, not the process that it took to get there.

Linux development may have used information from copyrighted textbooks. The source code doesn't contain the text of those textbooks, and in some sense could not be "reproduced" without the copyrighted text.

Similarly, AIs are often trained on copyrighted textbooks but the end result is open source.


> Not necessarily. You can even alter the architecture!

You can alter the architecture, but you're still playing with an opaque blob of binary *you don't know what it's made of*.

> The analogy here is that some Linux kernel developer could have left a back door in the Linux kernel source. You're arguing that Linux would only be open source if you could personally go back to the time when it was an empty folder on Linus Torvald's computer and then reproduce every step it took to get to today's tarball of the source, including every Google search done, every book referenced, every email read, etc...

No, it is just a bad analogy. To be sure that there's no backdoor in the Linux kernel, the code itself suffice. That doesn't mean there can be no backdoor since it's complex enough to hide things in it, but it's not the same thing as a backdoor hidden in a binary blob you cannot inspect even if you had a trillion dollar to spend on a million of developers.

> The code is open, not the process that it took to get there.

The code is by definition a part of a process that gets you a piece of software (which is the actually useful binary), and it's the part of the process that contains most of the value. Model weights are binary, and they are akin to the compiled binary of the software (training from data being a compute-intensive like compilation from source code, but orders of magnitude more intensive).

> Similarly, AIs are often trained on copyrighted textbooks but the end result is open source.

Court decisions are pending on the mere legality of such training, and it has nothing to do with being open-source, what's at stake is whether or not these models can be open-weight or if it is copyright infringement to publish the models.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: