But 80Gbit/s is way slower than even regular dual channel RAM, or am I missing something here? That would mean the LLM would be excruciatingly slow. You could get an old EPYC for a fraction of that price and have more performance.
If I'm not mistaken, each token produced roughly equals the whole model in memory transfers (the exception being MoE models). That's why memory bandwidth is so important in the first place, or not?
My understanding is that if you can store 1/Nth of the weights in RAM on each of the N nodes then there's no need to send the weights over the network.
You're correct about the weights: each machine could in fact store all of the weights. However I think you still have to transfer the activations and the KV-Cache while performing inference.