But 80Gbit/s is way slower than even regular dual channel RAM, or am I missing s...

wmf · on March 5, 2025

The weights don't go over the network so performance is OK.

atwrk · on March 5, 2025

If I'm not mistaken, each token produced roughly equals the whole model in memory transfers (the exception being MoE models). That's why memory bandwidth is so important in the first place, or not?

wmf · on March 5, 2025

My understanding is that if you can store 1/Nth of the weights in RAM on each of the N nodes then there's no need to send the weights over the network.

unsatchmo · on March 6, 2025

You're correct about the weights: each machine could in fact store all of the weights. However I think you still have to transfer the activations and the KV-Cache while performing inference.