Technology
Mixture of Experts (MoE)
MoE scales model capacity by activating only a fraction of total parameters per token via specialized sub-networks.
Mixture of Experts (MoE) replaces dense layers with sparse expert layers, using a router to direct input tokens to the most relevant sub-networks. This architecture allows models like Mixtral 8x7B to maintain the inference speed of a 12B parameter model while leveraging the knowledge of a 45B parameter system. By decoupling total parameter count from compute cost, MoE enables massive scaling (seen in Google's Switch Transformer hitting 1.6 trillion parameters) without a linear increase in FLOPs. It is the core mechanism behind GPT-4's efficiency, balancing high-tier reasoning with manageable latency.
Recent Talks & Demos
Showing 1-0 of 0