Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Alignment is how you get a pretrained model to do what you want. The word is vague. The implementation is not.
There are three dominant methods right now. They are not interchangeable.
Reinforcement Learning from Human Feedback trains a separate reward model on human preference data, then uses Proximal Policy Optimization to update the policy. During training you hold four models in memory simultaneously: policy, reference policy, reward model, and value network.
That is the source of its cost and its instability. PPO is sensitive to hyperparameters, prone to reward hacking, and requires careful KL penalty tuning to prevent the model from drifting too far from the reference. The upside is real: because the reward model is learned explicitly, PPO can optimize for subtle, multi-dimensional human preferences.
ChatGPT and early Claude were trained with this approach.
Direct Preference Optimization, introduced by Rafailov et al. (2023), eliminates the reward model entirely. It derives the optimal reward implicitly from preference pairs — chosen and rejected completions — and casts alignment as a supervised fine-tuning problem.
Two models in memory. SFT-like training loop. High stability.
DPO trained Zephyr-7B to outperform a 70B RLHF model on MT-Bench. Llama 3 used DPO in combination with PPO. Qwen-Chat models used DPO directly.
The tradeoff: DPO is offline. It optimizes against a fixed dataset of preference pairs. If your preference data is stale or narrow, the model learns a narrow preference function. There is no mechanism to correct for out-of-distribution prompts during training.
Group Relative Policy Optimization, introduced in DeepSeek-R1, removes the value network from the RL loop. Instead of estimating per-token advantage with a learned critic, it samples a group of completions for each prompt and computes relative advantage from group statistics.
Three models in memory instead of four. Compute cost roughly 50% lower than standard PPO. Accessible enough to run on mid-tier GPU servers.
The key insight from DeepSeek-R1: for tasks with verifiable rewards — math, code, formal reasoning — GRPO produces emergent chain-of-thought behavior without any human annotation. The model learns to reflect and verify because correct answers are rewarded and wrong answers are not.
GRPO does not work well where rewards are non-verifiable. Human aesthetic preferences do not have ground truth. For those cases, you still need preference data, which means DPO or PPO.
| | PPO (RLHF) | DPO | GRPO |
|---|---|---|---|
| Models in memory | 4 | 2 | 3 |
| Compute cost | High | Low | Medium |
| Training stability | Low | High | Medium |
| Data required | Prompts + reward model | Preference pairs | Prompts + reward function |
| Best for | Rich human preference | Offline alignment, limited compute | Reasoning, math, code |
The choice of alignment algorithm is a systems decision, not a safety decision. It encodes assumptions about your data, your hardware, and what you are optimizing for.
RLHF is expensive and unstable, but expressive. DPO is cheap and stable, but static. GRPO is powerful for verifiable domains and useless where ground truth does not exist.
None of them guarantee alignment with human values. All of them are approximations of a preference function you defined — through annotations, preference pairs, or reward functions you wrote.
The model does not know what you want. It learns a proxy for what you measured. The engineering decision is: which proxy, at what cost, with what failure mode.
1. Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. NeurIPS 2023. https://arxiv.org/abs/2305.18290
2. Shao, Z., et al. (2024). DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300. https://arxiv.org/abs/2402.03300
3. Xu, S., et al. (2024). Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study. ICML 2024. https://proceedings.mlr.press/v235/xu24h.html
4. Zadorozhny, K. (2026). A Guide to Reinforcement Learning Post-Training for LLMs: PPO, DPO, GRPO, and Beyond. Hugging Face Blog. https://huggingface.co/blog/karina-zadorozhny/guide-to-llm-post-training-algorithms
5. Misar AI. (2026). DPO vs RLHF (PPO): Which Is Better in 2026? Misar Blog. https://www.misar.blog/compare/dpo-vs-rlhf-alignment
6. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022. https://arxiv.org/abs/2203.02155