Data Preparation for Fine-Tuning

Nobody demos the data cleaning. The slides show the eval jumping ten points after fine-tuning; they never show the two weeks of staring at malformed rows that made the ten points possible. But ask anyone who's shipped a fine-tune what actually decided whether it worked, and they'll tell you the same thing: it was the data, and it was almost always the data being worse than they thought.

This post is about the unglamorous middle — the part between "we should fine-tune" and "let's hit train." Get it right and a mediocre method gives good results. Get it wrong and the fanciest training recipe in the world faithfully learns your garbage.

Less, but clean, beats more

The instinct is to collect as much data as possible. The evidence points the other way. The 2023 LIMA work fine-tuned a strong base on just a thousand carefully curated examples and got a model that competed with ones trained on vastly more — the argument being that a capable base model already knows most of what it needs, and fine-tuning is teaching it which behavior to surface, not stuffing in new knowledge. A small, clean, diverse set does that better than a large, noisy one.

So the goal of data prep isn't volume. It's three things: every example is correct, the set is diverse enough to cover the behavior you want, and the format exactly matches what training expects. Most failures are a breakdown in one of those three, and they're each avoidable.

A data-prep pipeline state diagram from collection to split — Data flows through cleaning, formatting, and a quality gate before the train/val split.

Clean: the boring work that pays

Deduplicate. Duplicate and near-duplicate examples don't just waste compute; they quietly reweight your data, teaching the model that whatever got copied a hundred times is what matters. Exact dedup is easy. Near-dup — paraphrases, boilerplate that repeats with small edits — needs something like MinHash or embedding similarity. The 2021 work on deduplicating pretraining data found removing duplicates straight-up improved models; the same holds at fine-tune scale.

Decontaminate. This is the one that burns people. If any of your eval examples leaked into your training set, your eval number is a lie — the model memorized the answer, it didn't learn the skill. Before you trust a single metric, check that your test set shares nothing with your training set. Near-matches count.

Filter. Drop the truncated rows, the ones where the answer is empty, the ones where the assistant response is actually an apology or a refusal you didn't want, the outliers in length that will blow up your sequence budget. Eyeball the length distribution; surprises there usually mean a parsing bug upstream.

Format: the failure that wastes a whole run

Here's the mistake that silently ruins more fine-tunes than any algorithm choice: formatting the data wrong for the model's chat template.

Every instruction-tuned model expects its conversations wrapped in a specific structure — special tokens marking where the system prompt, the user turn, and the assistant turn begin and end. These tokens are not decorative. They're how the model knows whose turn it is. If you train with the wrong template, or hand-roll the formatting and get a token subtly wrong, the model learns against a structure it will never see at inference, and the result is a model that's somehow worse than before you touched it — with no error message to tell you why.

The fix is to never hand-roll it. Use the tokenizer's own chat template, which encodes exactly the format that model was trained with:

messages = [
    {"role": "system", "content": "You are a terse SQL assistant."},
    {"role": "user", "content": "Top 5 customers by revenue?"},
    {"role": "assistant", "content": "SELECT ... LIMIT 5;"},
]

text = tokenizer.apply_chat_template(messages, tokenize=False)
# renders the model's exact special tokens — don't write these by hand

The other half of formatting is loss masking. You usually want the model to learn the assistant's responses, not to learn to predict the user's prompts back. That means masking the prompt tokens out of the loss so gradients only flow from the parts you actually want the model to produce. Most fine-tuning frameworks handle this for you when you use a completion-only or chat data collator — but it's worth confirming it's on, because training on the prompt tokens dilutes the signal and is invisible unless you look.

Diverse: cover the behavior, not just the easy cases

A clean dataset that only shows the model the happy path produces a model that only handles the happy path. If you're tuning a support agent, your data needs the angry customer, the ambiguous request, the question it should refuse or escalate — not fifty variations of the polite, answerable case. Coverage of edge behavior is what separates a fine-tune that demos well from one that survives contact with real traffic.

When you can't collect enough natural diversity, synthetic generation helps — the Self-Instruct line of work showed you can bootstrap a varied instruction set by having a strong model generate and reformulate examples. It's a real tool, with one rule attached: synthetic data inherits the generator's blind spots and biases, so it needs the same cleaning and human spot-checking as anything else. Generated-and-never-reviewed is how subtle, systematic errors get baked in at volume.

Split, then resist the urge to peek

Hold out a validation set before you train, and a separate test set you don't look at until the end. Keep them genuinely separate from training — same decontamination rule as above. The validation set tells you when to stop; the untouched test set tells you whether you actually improved. If you tune your hyperparameters against your test set, you've turned it into a training set and you no longer have an honest measurement of anything.

The reason data prep feels thankless is that doing it well produces nothing visible — no clever architecture, no benchmark to brag about, just a quiet pile of correct, well-formatted, deduplicated examples. But that pile is the entire ballgame. Method choice — LoRA versus DoRA, DPO versus KTO — moves results by a little. The difference between careful data and careless data moves them by a lot. If you only have time to get one thing right before a fine-tune, get the data right. The training loop is the easy part.