Is “reproducibility” actually the right term here? It’s a bit like arguing that ...

littlestymaar · on April 18, 2024

> It’s a bit like arguing that Linux is not open source because you don’t have every email Linus and the maintainers ever received. Or that you don’t know what lectures Linus attended or what books he’s read.

Linux is open source, because you can actually compile it yourself! You don't need Linus's email for that (and if you needed some secret cryptographic key on Linus' laptop to decrypt and compile the kernel, then it wouldn't make sense to call it open-source either).

A language model isn't a piece of code, it's a huge binary blob that's being executed by a small piece of code that contains little of the added value, everything that matters is in the blob. Sharing only the compiled blob and the code to run makes it unsuitable for an “open source qualifier” (It's kind of the same thing as proprietary Java code: the VM is open-source but the bytecode you run on it isn't).

And yes, you can fine-tune and change things in the model weights themselves the same way you can edit the binary of a proprietary game to disable DRMs, that doesn't make it open-source either. Fine tuning doesn't give you the same level of control over the behavior of the model as the initial training does, like binary hacking doesn't give you the same control as having the source code to edit and rebuild.

mensetmanusman · on April 18, 2024

It's a blob that costs over $10,000,000 in electricity costs to compile. Even if they released everything only the rich could push go.

soulofmischief · on April 18, 2024

There is an argument to be made about the importance of archeological preservation of the provenance of models, especially the first few important LLMs, for study by future generations.

In general, software rot is a huge issue, and many projects which may be of future archeological importance are increasingly non-reproducible as dependencies are often not vendored and checked into source, but instead downloaded at compile time from servers which lack strong guarantees about future availability.

_akhe · on April 19, 2024

This is comment is cooler than my Arctic Vault badge on GitHub.

Who were the countless unknown contemporaries of Giotto and Cimabue? Of Da Vinci and Michelangelo? Most of what we know about Renaissance art comes from 1 guy - Giorgio Vasari. We have more diverse information about ancient Egypt than the much more recent Italian Renaissance because of, essentially, better preservation techniques.

Compliance, interoperability, and publishing platforms for all this work (HuggingFace, Ollama, GitHub, HN) are our cathedrals and clay tablets. Who knows what works will fill the museums of tomorrow.

HarHarVeryFunny · on April 18, 2024

In today's Dwarkesh interview, Zuckerberg talks about energy becoming a limit for future models before cost or access to hardware does. Apparently current largest datacenters consume about 100MW, but Zuck is considering future ones consuming 1GW which is the output of typical nuclear reactor!

So, yeah, unless you own your own world-class datacenter, complete with the nuclear reactor necessary to power the training run, then training is not an option.

krisoft · on April 19, 2024

On a sufficiently large time scale the real limit on everything is energy. “Cost” and “access to hardware” are mere proxies for energy available to you. This is the idea behind the Kardashev scale.

HarHarVeryFunny · on April 19, 2024

A bit odd to see this downvoted... I'm not exactly a HN newbie, but still haven't fully grasped the reasons people often downvote here - simply not liking something (regardless of relevance or correctness) seems to often be the case, and perhaps sometimes even more petty reasons.

I think Zuck's discussion of energy being the limiting factor was one of the more interesting and surprising things to come out of the Dwarkesh interview. We're used to discussion of the $1B, $10B, $100B training runs becoming unsustainable, and chip shortages as an issue, but (to me at least!) it was interesting to see Zuck say that energy usage will be a disruptor before those do (partly because of lead times and regulations in expanding power supply, and bringing it in to new data centers). The sheer magnitude of projected power consumption needed is also interesting.

robertlagrant · on April 20, 2024

There is an odd contingent or set of contingents on here that do seem to down vote by ideology rather than lack of facts or lack of courtesy. It's a bit of a shame, but I'm not sure there's much to be done.

jiggawatts · on April 19, 2024

> the same way you can edit the binary of a proprietary game to disable DRMs, that doesn't make it open-source either

This is where I have to disagree. Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.

littlestymaar · on April 19, 2024

> Continuing the training of an open model is the same process as the original training run. It's not a fundamentally different operation.

In practice it's not (because LoRA) but that doesn't matter: continuing the training is just a patch on top of the initial training, it doesn't matter if this patch is applied through gradient descent as well, you are completely dependent on how the previous training was done, and your ability to overwrite the model's behavior is limited.

For instance, Meta could backdoor the model with specially crafted group of rare tokens to which the model would respond a pre-determined response (say “This is Llama 3 from Meta” as some kind of watermark), and you'd have no way to figure out and get rid of it during fine-tuning. This kind of things does not happen when you have access to the sources.

jiggawatts · on April 19, 2024

> (because LoRA)

That's one of many techniques, and is popular because it's cheap to implement. The training of a full model can be continued with full updates, the same as the original training run.

> completely dependent on how the previous training was done, and your ability to overwrite the model's behavior is limited.

Not necessarily. You can even alter the architecture! There have been many papers about various approaches such as extending token window sizes, or adding additional skip connections, quantization, sparsity, or whatever.

> specially crafted group of rare tokens

The analogy here is that some Linux kernel developer could have left a back door in the Linux kernel source. You're arguing that Linux would only be open source if you could personally go back to the time when it was an empty folder on Linus Torvald's computer and then reproduce every step it took to get to today's tarball of the source, including every Google search done, every book referenced, every email read, etc...

That's not what open source is. The code is open, not the process that it took to get there.

Linux development may have used information from copyrighted textbooks. The source code doesn't contain the text of those textbooks, and in some sense could not be "reproduced" without the copyrighted text.

Similarly, AIs are often trained on copyrighted textbooks but the end result is open source.

littlestymaar · on April 19, 2024

> Not necessarily. You can even alter the architecture!

You can alter the architecture, but you're still playing with an opaque blob of binary *you don't know what it's made of*.

> The analogy here is that some Linux kernel developer could have left a back door in the Linux kernel source. You're arguing that Linux would only be open source if you could personally go back to the time when it was an empty folder on Linus Torvald's computer and then reproduce every step it took to get to today's tarball of the source, including every Google search done, every book referenced, every email read, etc...

No, it is just a bad analogy. To be sure that there's no backdoor in the Linux kernel, the code itself suffice. That doesn't mean there can be no backdoor since it's complex enough to hide things in it, but it's not the same thing as a backdoor hidden in a binary blob you cannot inspect even if you had a trillion dollar to spend on a million of developers.

> The code is open, not the process that it took to get there.

The code is by definition a part of a process that gets you a piece of software (which is the actually useful binary), and it's the part of the process that contains most of the value. Model weights are binary, and they are akin to the compiled binary of the software (training from data being a compute-intensive like compilation from source code, but orders of magnitude more intensive).

> Similarly, AIs are often trained on copyrighted textbooks but the end result is open source.

Court decisions are pending on the mere legality of such training, and it has nothing to do with being open-source, what's at stake is whether or not these models can be open-weight or if it is copyright infringement to publish the models.