Technology
Muon optimizer
Muon is a matrix-aware neural network optimizer that leverages Newton–Schulz orthogonalization to deliver state-of-the-art training efficiency for large language models (LLMs), outperforming AdamW.
This is the Muon optimizer: a high-efficiency alternative to AdamW, specifically engineered for the 2D weight matrices in neural network hidden layers. Muon's core mechanism applies a Newton–Schulz orthogonalization to the momentum, effectively optimizing the update direction to stabilize training and ensure better convergence. This matrix-aware approach is crucial for scaling, enabling automatic learning rate transfer (muP scaling) across model widths. It has been validated by setting new speed records for both NanoGPT and CIFAR-10 training, confirming its superior sample efficiency over standard optimizers for large-scale AI workloads. Note: It is typically used in conjunction with AdamW for non-2D parameters (embeddings, biases).
Related technologies
Recent Talks & Demos
Showing 1-1 of 1