A 19" server chassis is wide enough for 8 vertically mounted GPUs next to each other, with just enough space left for the power supplies. Consequently 8 GPUs is a common and cost efficient configuration in servers.
Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.
You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.
Everyone seems to put each expert on a different GPU in training and inference, so that's how you get to 8 experts, or 7 if you want to put the router on its own GPU too.
You could also do multiples of 8. But from my limited understanding it seems like more experts don't perform better. The main advantage of MoE is the ability to split the model into parts that don't talk to each other, and run these parts in different GPUs or different machines.