This is a bit of a misnomer. Each expert is a sub network that specializes in sub understanding we can't possibly track.
During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.
The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.
I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)
Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.
I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.
Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.
It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.
We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".
It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.
Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.
Maybe a real expert can confirm if this is correct :)
Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?
During training a routing network is punished if it does not evenly distribute training tokens to the correct experts. This prevents any one or two networks from becoming the primary networks.
The result of this is that each token has essentially even probability of being routed to one of the sub models, with the underlying logic of why that model is an expert for that token being beyond our understanding or description.