Fine-tuning·Feb 23, 2026·5 minfinetuning llm deeplearning

The Ladder: Prompt, RAG, Fine-tune, Distill

Most fine-tuning projects should have stayed a prompt.

I've watched it happen enough times to call it a pattern. A model gives a slightly-off answer in a demo, someone says "we need to fine-tune it on our data," a GPU budget gets approved, and three weeks later the team has an adapter that's marginally better than the prompt they started with — and a new pipeline to maintain forever. The fix was usually two sentences in a system prompt.

So before any of the nine posts in this fine-tuning series get into LoRA ranks and 4-bit quantization, here's the part that matters more: knowing which rung of the ladder you actually need. There are four. Each one costs roughly ten times the effort of the one below it, and most teams should stop earlier than they think.

The four rungs

Prompt engineering. Few-shot examples, a clear system prompt, output schemas, maybe a chain-of-thought nudge. Iteration time is seconds. You change a string and re-run. Nothing to host, nothing to retrain.

Retrieval (RAG). You attach a search step so the model can pull in facts it was never trained on — your docs, your tickets, last week's pricing. The model's weights don't change; its context does. Iteration time is minutes to hours (build an index, tune retrieval).

Fine-tuning. Now you change the weights. Usually a thin LoRA or QLoRA adapter — a fraction of a percent of the parameters — trained on a few hundred to a few thousand examples of the behavior you want. Iteration time is hours, plus a GPU and a real eval set.

Distillation. You take a big model that already does the job and teach a small one to imitate it. This is the rung you climb after you've proven the task is solvable, when inference cost or latency is the thing keeping you up at night. Iteration time is days.

A decision ladder from prompt to RAG to fine-tune to distill
Each rung costs roughly 10x the one below — climb only when the lower rung provably fails.

The rule for climbing

Climb only when the rung below you provably fails — and "provably" means you have an eval, not a vibe. If you can't measure the gap, you can't tell whether fine-tuning closed it, and you'll be back here in a month arguing about whether the new adapter is actually better.

The single most common mistake has a clean diagnosis: people fine-tune to inject knowledge. They want the model to "know" their internal product catalog, so they fine-tune on the catalog. It mostly doesn't work. Fine-tuning is good at teaching a model how to behave — a format, a tone, a task structure, a domain's phrasing. It's bad at reliably teaching it new facts, and worse, the facts it does absorb go stale the moment the catalog changes, with no way to update short of retraining.

Here's the split that resolves most arguments:

When you genuinely can't tell which one you have, run a cheap test: stuff the relevant facts directly into the prompt. If the model nails the task with the facts in context, you have a knowledge problem — reach for retrieval, not training. If it still misbehaves with the facts in front of it, that's behavior, and a small adapter might earn its keep.

The combination nobody markets

The framing as a ladder makes it sound like you pick one rung and live there. The highest-ROI setups in production don't. They pair a thin fine-tuned adapter with retrieval: LoRA teaches the model the shape of the task — how to read your retrieved context, when to abstain, what the output should look like — and RAG supplies the live facts at inference time. The adapter handles behavior; retrieval handles knowledge. Each does the thing it's actually good at.

A LoRA adapter is small enough that this isn't extravagant. Configuring one with PEFT is a handful of lines:

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=16,                 # rank — the size of the low-rank update
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, config)
model.print_trainable_parameters()
# trainable params: ~4M  ||  all params: ~8B  ||  trainable%: 0.05

Half a percent of the parameters, sometimes less. That's the whole trick that makes the third rung cheap enough to consider — and it's where the next post picks up.

Where teams actually land

If I had to put rough numbers on it from what I've seen ship: a large majority of "we need to fine-tune" tasks are solved at rungs one and two. Prompting plus retrieval covers the customer-support bots, the document Q&A, the internal search tools — anything where the bottleneck is which facts the model can see. Fine-tuning earns its place for a real minority: structured extraction at scale, strict-format generation, classification on private label spaces, voice-and-tone matching, and shrinking a prompt that's grown to thousands of tokens of instructions into learned behavior. Distillation is rarer still, and almost always a cost story — you already have something that works and you need it to work for a tenth of the price per call.

None of this is an argument against fine-tuning. The rest of this series is about doing it well, because when you do need it, the gap between a careless adapter and a careful one is enormous. It's an argument against fine-tuning first. The ladder is cheap at the bottom for a reason. Spend your effort proving you can't solve the problem with a better prompt and a search index before you reach for a GPU — and when the eval finally says you can't, climb with a clear head about which rung you're standing on and why.

Leave a Reply

Your email address will not be published.