This is a bit of a misnomer. Each expert is a sub network that specializes in su...

andai · on April 17, 2024

I heard MoE reduces inference costs. Is that true? Don't all the sub networks need to be kept in RAM the whole time? Or is the idea that it only needs to run compute on a small part of the total network, so it runs faster? (So you complete more requests per minute on same hardware.)

Edit: Apparently each part of the network is on a separate device. Fascinating! That would also explain why the routing network is trained to choose equally between experts.

I imagine that may reduce quality somewhat though? By forcing it to distribute problems equally across all of them, whereas in reality you'd expect task type to conform to the pareto distribution.

MPSimmons · on April 17, 2024

>I heard MoE reduces inference costs

Computational costs, yes. You still take the same amount of time for processing the prompt, but each token created through inference costs less computationally than if you were running it through _all_ layers.

samus · on April 17, 2024

It should increase quality since those layers can specialize on subsets of the training data. This means that getting better in one domain won't make the model worse in all the others anymore.

We can't really tell what the router does. There have been experiments where the router in the early blocks was compromised, and quality only suffered moderately. In later layers, as the embeddings pick up more semantic information, it matters more and might approach our naive understanding of the term "expert".

Filligree · on April 17, 2024

The latter. Yes, it all needs to stay in memory.

fire_lake · on April 17, 2024

Why do we expect this to perform better? Couldn’t a regular network converge on this structure anyways?

famouswaffles · on April 17, 2024

It doesn't perform better and until recently, MoE models actually underperformed their dense counterparts. The real gain is sparsity. You have this huge x parameter model that is performing like an x parameter model but you don't have to use all those parameters at once every time so you save a lot on compute, both in training and inference.

imjonse · on April 17, 2024

It is a type of ensemble model. A regular network could do it, but a MoE will select a subset to do the task faster than the whole model would.

rgbrgb · on April 17, 2024

Here's my naive intuition: in general bigger models can store more knowledge but take longer to do inference. MoE provides a way to blend the advantages of having a bigger model (more storage) with the advantages of having smaller models at inference time (faster, less memory required). When you do inference, tokens hit a small layer that is load balancing the experts then activate 1 or 2 experts. So you're storing roughly 8 x 22B "worth" of knowledge without having to run a model that big.

Maybe a real expert can confirm if this is correct :)

nialv7 · on April 17, 2024

Sounds like the "you only use 10% of your brain" myth, but actually real this time.

samus · on April 17, 2024

Almost :) the model chooses experts in every block. For a typical 7B with 8 experts there will be 8^32=2^96 paths through the whole model.

cjbprime · on April 17, 2024

Not quite, you don't save memory, only compute.

api · on April 17, 2024

A decent loose analogy might be database sharding.

Basically you're sharding the neural network by "something" that is itself tuned during the learning process.

wenc · on April 17, 2024

Would it be analogous to say instead of having a single Von Neumann who is a polymath, we’re posing the question to a pool of people who are good at their own thing, and one of them gets picked to answer?

Filligree · on April 17, 2024

Not really. The “expert” term is a misnomer; it would be better put as “brain region”.

Human brains seem to do something similar, inasmuch as blood flow (and hence energy use) per region varies depending on the current problem.

andai · on April 17, 2024

Any idea why everyone seems to be using 8 experts? (Or was GPT-4 using 16?) Did we just try different numbers and found 8 was the optimum?

wongarsu · on April 17, 2024

Probably because 8 GPUs is a common setup, and with 8 experts you can put each expert on a different GPU

andai · on April 17, 2024

Has anyone tried MoE at smaller scales? e.g. a 7B model that's made of a bunch of smaller ones? I guess that would be 8x1B.

Or would that make each expert too small to be useful? TinyLlama is 1B and it's almost useful! I guess 8x1B would be Mixture of TinyLLaMAs...

samus · on April 17, 2024

There is Qwen1.5-MoE-A2.7B, which was made by upcycling the weights of Qwen1.5-1.8B, splitting it and finetuning it.

jasonjmcghee · on April 17, 2024

Yes there are many fine tunes on huggingface. Search "8x1B huggingface"

auspiv · on April 17, 2024

The previous mixtral is 8x7B