DPO answered the common case. This post is about the three objectives you reach for when the common case doesn't fit — and the thing they all answer to is the shape of your data.
That's the lens I'd use to keep these straight, because the names don't help. PPO, GRPO, KTO. They sound like a family, and in TRL they're all a few lines apart, but they assume different things about what you've collected. Get the data-shape question right and the choice mostly makes itself.
PPO: the original, and the price of it
PPO is the reinforcement-learning algorithm RLHF was built on, and it predates language models entirely — it's a general policy-gradient method from 2017. In the alignment setting it works like this: the model generates a response, a reward model scores it, and PPO updates the policy to chase higher scores, clipping each update so the policy never lurches too far in one step. That clipping is the "proximal" part, and it's what makes PPO stable enough to use at all.
The hidden cost is the value model (a "critic"). PPO doesn't just need a reward; it needs an estimate of how good a partially-generated response is expected to become, to figure out which tokens deserve credit. That critic is typically another network the size of the policy. So a PPO run holds the policy, the reward model, a reference model, and the value model — four models, the heaviest of the preference-tuning setups. It's powerful and well-understood, and it's why people kept looking for something lighter.
GRPO: drop the critic, sample a group instead
GRPO — Group Relative Policy Optimization — came out of the DeepSeekMath work in 2024 and got famous as the engine behind reasoning-model training in early 2025. Its move is to delete the value model and recover the missing information from the group.
Here's the idea. For each prompt, you sample not one response but a group — say eight or sixteen. You score them all. Then instead of asking "how good is this response in absolute terms," you ask "how good is it relative to its siblings for the same prompt." Normalize each response's reward against the group's mean and spread, and that relative score becomes the advantage signal. No critic needed — the group of samples plays the role the value model used to.
Why this caught on: it's a natural fit for tasks with a clean, automatic reward. Math problems and code have checkable answers — the reward is just "did it pass" — so you can run GRPO without any human-labeled preferences at all, generating and grading on the fly. Dropping the value model also cuts memory and removes one more finicky thing to tune. The trade is that you're sampling a whole group per prompt every step, which costs generation time; you've moved the expense from a model in memory to compute in the loop.
KTO: when you don't have pairs at all
PPO, GRPO, and DPO all assume you can say A is better than B. KTO throws out that assumption, and it's the one I'd flag for teams sitting on the wrong kind of feedback.
Real product feedback rarely arrives as clean pairs. It arrives as thumbs-up and thumbs-down, one signal per response, no matching counterpart. Turning that into preference pairs means manufacturing comparisons that were never actually made. KTO — Kahneman-Tversky Optimization, named for the prospect-theory work it borrows from — learns directly from those unpaired binary labels. Each example is just a response tagged desirable or undesirable, and the loss is shaped by a model of how humans weigh gains against losses, pushing up the good and down the bad without ever needing the two to be matched.
The practical unlock is the data. If your feedback is naturally binary and unpaired — and most logged human feedback is — KTO lets you use it as-is instead of contorting it into pairs. It also tends to be forgiving when your positive and negative examples are imbalanced, which is the normal state of real logs.
They're all a few lines apart in TRL
The reason this isn't an intimidating menu is that TRL exposes each one as a trainer with the same skeleton. You swap the trainer and the dataset shape; the surrounding code barely moves. KTO with unpaired data:
from trl import KTOConfig, KTOTrainer
# dataset columns: prompt, completion, label (True = desirable)
trainer = KTOTrainer(
model=model,
args=KTOConfig(beta=0.1, learning_rate=5e-6),
train_dataset=binary_feedback,
processing_class=tokenizer,
)
trainer.train()
GRPO instead takes a reward function — often just code that checks the answer — and samples a group per prompt:
from trl import GRPOConfig, GRPOTrainer
def reward_correct(completions, **kwargs):
return [1.0 if passes_tests(c) else 0.0 for c in completions]
trainer = GRPOTrainer(
model=model,
reward_funcs=reward_correct,
args=GRPOConfig(num_generations=8, learning_rate=1e-6),
train_dataset=prompts,
processing_class=tokenizer,
)
trainer.train()
Same shape, different assumption. KTO wants labeled completions; GRPO wants prompts and a scorer; PPO wants a reward model and brings the most machinery.
Picking one without overthinking it
Walk the data-shape question, not the hype. If your reward is automatic and checkable — math, code, anything with a verifier — GRPO is the natural fit and you may not need human labels at all. If your feedback is binary thumbs and won't pair up cleanly, KTO uses it without you faking comparisons. If you specifically need online exploration against a learned reward at scale, and you have the team to run it, PPO is still the most capable and the most demanding. And if you have clean preference pairs and just want the result cheaply — that was the last post; DPO is probably still your answer.
None of these is an upgrade over the others in the abstract. They're answers to different questions about what's sitting in your dataset. Read the data first; the trainer is almost an afterthought.
Leave a Reply
Your email address will not be published.