Technology

GRPO

Group Relative Policy Optimization (GRPO) is the efficient RL algorithm from DeepSeek, eliminating the PPO value model to cut memory and boost LLM reasoning.

GRPO (Group Relative Policy Optimization) is an advanced reinforcement learning (RL) algorithm, a key innovation from the DeepSeek team (DeepSeekMath, DeepSeek-R1). It directly addresses the resource bottleneck of Proximal Policy Optimization (PPO) by eliminating the memory-intensive 'critic' or value function model. Instead, GRPO generates a 'group' of responses (e.g., 64 samples), uses the mean reward as a baseline for advantage estimation, and applies a KL divergence constraint for stability. This approach significantly reduces memory consumption and computational cost, making high-quality RL fine-tuning for complex reasoning tasks (like the competition-level MATH benchmark) more scalable and efficient for large language models.

https://arxiv.org/abs/2402.03300

1 project · 1 city

Related technologies

Nanotron 1

Recent Talks & Demos

Showing 1-1 of 1

Members-Only

Small reasoning model

Rome Mar 3

Nanotron GRPO