
What is Reinforcement Learning?
Reinforcement learning (RL) is a machine learning paradigm in which an agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones. The agent's goal is to maximize cumulative reward over time.
Why It Matters
RL is the paradigm behind some of AI's most impressive achievements: AlphaGo defeating world champions, robotic arms learning to manipulate objects, and — critically for LLMs — RLHF (Reinforcement Learning from Human Feedback), which aligns language models to be helpful and safe. RL is also essential for training autonomous agents that take multi-step actions.
How It Works
An RL system consists of:
- Agent — the learner that takes actions.
- Environment — the world the agent operates in.
- State — the current situation.
- Action — what the agent can do.
- Reward — a scalar signal indicating how good the action was.
- Policy — the agent's strategy for choosing actions given states.
The agent follows a loop: observe state → choose action → receive reward → update policy. Over thousands or millions of episodes, it learns which actions lead to the highest long-term reward.
Key algorithms include:
- Q-learning — learn the value of state-action pairs
- Policy gradient — directly optimize the policy
- PPO (Proximal Policy Optimization) — the algorithm used in RLHF for LLMs
- Actor-critic — combine value estimation with policy optimization
Example
DeepMind's AlphaGo learned to play Go by playing millions of games against itself, receiving a reward of +1 for winning and -1 for losing. Through RL, it discovered strategies that surprised even expert human players.