Technology

Muon optimizer

Muon is a matrix-aware neural network optimizer that leverages Newton–Schulz orthogonalization to deliver state-of-the-art training efficiency for large language models (LLMs), outperforming AdamW.

This is the Muon optimizer: a high-efficiency alternative to AdamW, specifically engineered for the 2D weight matrices in neural network hidden layers. Muon's core mechanism applies a Newton–Schulz orthogonalization to the momentum, effectively optimizing the update direction to stabilize training and ensure better convergence. This matrix-aware approach is crucial for scaling, enabling automatic learning rate transfer (muP scaling) across model widths. It has been validated by setting new speed records for both NanoGPT and CIFAR-10 training, confirming its superior sample efficiency over standard optimizers for large-scale AI workloads. Note: It is typically used in conjunction with AdamW for non-2D parameters (embeddings, biases).

https://kellerjordan.github.io/posts/muon/

1 project · 1 city

Related technologies

LLM 91 NanoGPT 4 NVIDIA H100 4 PyTorch 265

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

NanoGPT Training Speedruns

Portland Oct 29

NanoGPT PyTorch