Technology

Mixture of Experts (MoE)

MoE scales model capacity by activating only a fraction of total parameters per token via specialized sub-networks.

Mixture of Experts (MoE) replaces dense layers with sparse expert layers, using a router to direct input tokens to the most relevant sub-networks. This architecture allows models like Mixtral 8x7B to maintain the inference speed of a 12B parameter model while leveraging the knowledge of a 45B parameter system. By decoupling total parameter count from compute cost, MoE enables massive scaling (seen in Google's Switch Transformer hitting 1.6 trillion parameters) without a linear increase in FLOPs. It is the core mechanism behind GPT-4's efficiency, balancing high-tier reasoning with manageable latency.

https://huggingface.co/blog/moe

1 project · 1 city

Related technologies

Forward hook 1 IBM Granite 3 Python 618 PyTorch 265

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

MoE parameters, MoE problems: visualizing Mixture of Experts Routing …

Manchester NH Apr 15

Mixture of Experts (MoE) IBM Granite