DPO vs RLHF

For a couple of years, teaching a model to prefer good answers over bad ones meant running three models at once and hoping they stayed friends.

That was RLHF — reinforcement learning from human feedback — and it's the technique that made the first wave of genuinely helpful chat models. It worked. It was also a beast to operate, and in 2023 a paper called Direct Preference Optimization showed that for a large class of cases you could get the same result without the reinforcement learning at all. DPO didn't beat RLHF on a leaderboard so much as make it unnecessary for most teams. Here's what each one actually does, and why the simpler one won the default.

What RLHF is doing

Both methods start from the same raw material: preference pairs. A prompt, two responses, and a human (or a model standing in for one) saying this one is better. Collect a pile of those and you've encoded a taste you want the model to learn.

RLHF turns that taste into a model in three stages:

Supervised fine-tuning (SFT). Get a decent instruction-following model first. This is the starting point everything else builds on.
Train a reward model. Take the preference pairs and train a separate model to output a scalar score — higher for the preferred response. This reward model is a learned stand-in for human judgment.
Optimize the policy with RL. Now run reinforcement learning — usually PPO — where the model generates responses, the reward model scores them, and the policy updates to earn higher scores. A KL penalty against the original SFT model keeps it from drifting into gibberish that games the reward.

At training time, that means three models live in memory: the policy being trained, the reward model scoring it, and a frozen reference for the KL term. Plus PPO's own machinery — sampling new generations every step, value estimates, the careful clipping that keeps updates stable. It's a full RL training loop bolted onto a language model, and RL loops are famously finicky to tune.

What DPO is doing

DPO's insight is mathematical and, once you see it, slightly annoying in how clean it is. The whole reward-model-plus-PPO apparatus is solving an optimization problem, and that problem has a closed-form relationship between the optimal policy and the reward. Work the algebra and the reward model cancels out — you can express the thing you're optimizing directly in terms of the policy you're training and a frozen reference. No separate reward model. No sampling loop. No reinforcement learning.

What's left is a single loss function over your preference pairs. For each pair, DPO pushes the policy to raise the relative log-probability of the chosen response over the rejected one, relative to the reference model, with a temperature beta controlling how hard it pushes. It's structured like a classification loss. You can train it with the same ordinary supervised machinery you'd use for any fine-tune — forward pass, backward pass, done.

RLHF three-stage pipeline versus DPO single-stage pipeline — RLHF trains a reward model and runs PPO; DPO optimizes preferences directly in one stage.

In TRL the difference shows up exactly where you'd expect — you hand a DPOTrainer your pairs and a beta, and it looks almost like any other trainer:

from trl import DPOConfig, DPOTrainer

trainer = DPOTrainer(
    model=model,                 # your SFT model, plus a LoRA config if you like
    ref_model=None,              # TRL can use the frozen base as the reference
    args=DPOConfig(beta=0.1, learning_rate=5e-6),
    train_dataset=pairs,         # columns: prompt, chosen, rejected
    processing_class=tokenizer,
)
trainer.train()

That's it. No reward model to train and babysit, no RL hyperparameters, two models in memory instead of three — and DPO composes with LoRA, so the whole thing fits on modest hardware.

So why does RLHF still exist?

Because "simpler and usually as good" is not "always as good," and the gap shows up at the frontier.

The reward model in RLHF is reusable and online. It can score responses the model has never produced, including ones it generates fresh during training, which means RLHF can keep exploring and improving against a signal long after a fixed dataset would have been exhausted. DPO learns from a frozen set of pairs; it can't score a novel generation, so it's bounded by the preferences you collected. For large-scale alignment where you want to squeeze out the last increments of quality, the separate reward model and the exploration that comes with online RL still carry their weight — which is why the biggest labs didn't simply throw RLHF away.

DPO also has its own failure modes. It can be sensitive to how the SFT checkpoint and the preference data relate; push beta wrong and it either barely moves or overfits to the quirks of your pairs. It is not a magic "alignment" button. It's a cheaper, far more reliable way to do the common case.

The practical read

If you're a team that wants your model to be more helpful, less sycophantic, better-formatted, or tuned to a house style — and you've got preference data or can generate it — start with DPO. It's the right default now for a reason: one training stage, ordinary supervised tooling, plays nicely with LoRA, and skips the part of RLHF that historically ate the most engineering time. Reach for the full reward-model-plus-PPO pipeline when you're operating at a scale where online exploration measurably buys you something, and you have the team to run an RL loop without it running you.

The arc here isn't "RLHF was wrong." RLHF proved preference learning works at all. DPO just noticed that, for most people, two of the three models were doing work the math could do for free.

What RLHF is doing

What DPO is doing

So why does RLHF still exist?

The practical read

Leave a Reply