#llm-training
1 paper
-
inspiration
Alignment Is an Engineering Decision
RLHF, DPO, and GRPO are not synonyms for 'making the model safe.' They are distinct training algorithms with different memory costs, stability profiles, and data requirements. Picking the wrong one does not just waste compute — it produces a model optimized for the wrong objective.