You proved the task is solvable. A big model does it well in your eval, the demo lands, and then someone in finance pulls up the projected inference bill and the conversation changes. The model that works costs too much to run at the volume you're about to hit. This is the moment distillation is for.
Distillation is the act of teaching a small "student" model to imitate a large "teacher." It's the top rung of the ladder from the first post in this series — the one you climb after you've established that the capability exists, when the only thing left to fix is the price per call. And it has a wrinkle that's become the interesting part lately: with reasoning models, you're not just distilling answers. You're distilling the thinking.
The original trick: learn from the teacher's uncertainty
The 2015 idea behind distillation is subtler than "train the small model on the big model's outputs." The insight is about which outputs.
When a teacher classifies an image of a dog, it doesn't just output "dog." It outputs a full probability distribution — dog 0.9, wolf 0.06, cat 0.001, and so on. That distribution carries information a plain label can't: it says a dog looks much more like a wolf than like a cat. Hinton called these the soft targets, and they're richer than the hard answer because they encode the teacher's whole sense of how the classes relate.
So classic knowledge distillation trains the student to match the teacher's distribution, not just its top pick. You raise a "temperature" on both models' outputs to soften the probabilities and expose the small differences, then minimize the divergence between teacher and student distributions. The student learns not just the answer but the teacher's structured uncertainty around it — and that turns out to be a much stronger training signal than labels alone.
From distributions to sequences to reasoning
For language models the picture stretches in two steps.
First, sequence-level distillation: rather than matching token-by-token distributions, you let the teacher generate full outputs and train the student on those generations. The teacher becomes a data factory, producing high-quality completions the student learns to reproduce. This is how a lot of "distilled" open models are actually made — a strong model writes the training set, a small model learns from it.
Second, and this is the recent twist, chain-of-thought distillation. The 2023 Distilling Step-by-Step! work showed that if the teacher doesn't just give answers but also writes out its reasoning — the intermediate steps — and you train the student on the reasoning too, the student learns more from less. It can match or beat a model trained on far more answer-only data, because the rationales teach it how to get there, not just where to land. Each example carries more signal.
That idea went mainstream with the reasoning models of early 2025. When a large reasoning model thinks through a problem in long, explicit steps, those traces are a goldmine. Capture them, train a much smaller dense model on them, and the small model inherits a real chunk of the reasoning ability — distilled reasoners that fit on hardware the teacher never would. The teacher does the expensive thinking once, at training time; the student does a cheaper imitation of it forever after.
Distillation meets LoRA
There's a natural pairing worth knowing about. The student still has to be trained on all that teacher-generated data, and training a student from scratch — or full-fine-tuning it — brings back the cost you were trying to escape. So you fine-tune the student with LoRA on the distilled data instead. The 2024 KD-LoRA work formalized exactly this: combine knowledge distillation with a low-rank adapter so the student-building step is itself cheap. You get a small base model, a tiny trained adapter, and a training run that fits on modest hardware — the compression story and the parameter-efficiency story stacked on top of each other.
In practice the distilled-data fine-tune looks like an ordinary supervised run; the "distillation" lives in how the dataset was built, not in exotic training code:
# 1. Teacher generates reasoning traces for your prompts (offline, once)
# each row: { "prompt": ..., "completion": "<rationale> ... <answer> ..." }
from trl import SFTTrainer, SFTConfig
from peft import LoraConfig
trainer = SFTTrainer(
model=small_student_model, # e.g. an 8B base
args=SFTConfig(learning_rate=1e-4, num_train_epochs=2),
train_dataset=teacher_traces, # the distilled dataset
peft_config=LoraConfig(r=16, lora_alpha=32, task_type="CAUSAL_LM"),
processing_class=tokenizer,
)
trainer.train()
The teacher's effort is frozen into teacher_traces. Everything after is a cheap LoRA fine-tune.
The catch worth naming
Distillation has a ceiling, and it's an honest one: a student rarely exceeds its teacher on the teacher's own ground. You're copying a capability, and copies lose detail. On the narrow task you distilled, a good student gets remarkably close. Drift outside that task and the gap widens — the student didn't inherit the teacher's full generality, just the slice you trained on. There's also a quieter risk: if the teacher is confidently wrong somewhere, the student learns the mistake faithfully, soft targets and all. Garbage thinking distills just as cleanly as good thinking.
Which is why distillation belongs where this post started — at the end of the process, not the start. You distill a capability you've already validated, on a task you've already scoped, to hit a cost target you've already measured. Do it then and it's one of the highest-payoff moves in the whole stack: the same behavior, a fraction of the model, a tenth of the bill. Do it before you know what you're copying and you've just made a small, cheap model that's confidently wrong at scale.
Leave a Reply
Your email address will not be published.