.

Technology

Mixture of Experts (MoE)

MoE scales model capacity by activating only a fraction of total parameters per token via specialized sub-networks.

Mixture of Experts (MoE) replaces dense layers with sparse expert layers, using a router to direct input tokens to the most relevant sub-networks. This architecture allows models like Mixtral 8x7B to maintain the inference speed of a 12B parameter model while leveraging the knowledge of a 45B parameter system. By decoupling total parameter count from compute cost, MoE enables massive scaling (seen in Google's Switch Transformer hitting 1.6 trillion parameters) without a linear increase in FLOPs. It is the core mechanism behind GPT-4's efficiency, balancing high-tier reasoning with manageable latency.

https://huggingface.co/blog/moe
0 projects · 0 cities

Recent Talks & Demos

Showing 1-0 of 0

Members-Only

Sign in to see who built these projects

No public projects found for this technology yet.