
Group Relative Policy Optimization (GRPO) is a reinforcement learning post-training algorithm for language models that evaluates groups of generated outputs relative to each other—rather than using a separate reward model—to update the model's policy more efficiently.
First proposed in the DeepSeekMath paper and adopted by the Hugging Face TRL library in 2026, GRPO has become a key alternative to PPO (Proximal Policy Optimization) for aligning language models, offering comparable performance with significantly lower computational cost.
Why It Matters
Traditional RLHF pipelines using PPO require training and maintaining a separate reward model, which adds substantial memory overhead and training complexity. GRPO eliminates this requirement by comparing outputs within a batch against each other, using the group's relative quality as the optimization signal. This makes reinforcement learning from human feedback accessible to teams with limited GPU resources and simplifies the training pipeline.
How It Works
Given a prompt, GRPO generates a group of candidate responses from the current policy. Each response is scored (using a rule-based verifier, human preference labels, or a lightweight scoring function). Instead of computing an absolute reward, GRPO normalizes scores within the group—computing each response's advantage relative to the group mean. The policy is then updated to increase the probability of above-average responses and decrease below-average ones, using a clipped objective similar to PPO for stability. This group-relative signal is both cheaper to compute and empirically more stable than absolute reward estimation.
Example
A team fine-tuning an open-weight model for math reasoning generates 8 candidate solutions per problem. A rule-based verifier checks which solutions arrive at the correct answer. GRPO computes relative advantages within each group of 8 and updates the model to favor solution strategies that consistently produce correct answers—no separate reward model needed.
Related Concepts
- RLHF (Reinforcement Learning from Human Feedback)
- Fine-Tuning
- Large Language Model (LLM)