A 7-billion-parameter model has 7 billion knobs. Full fine-tuning turns all of them. LoRA's whole bet is that you don't have to — that the change you want to make is far smaller than the model itself, even when the model is enormous.
That bet has a name and a paper behind it. In 2020, Aghajanyan and colleagues showed that big pretrained models have a low "intrinsic dimension" — you can adapt them to a new task by moving in a surprisingly tiny subspace of all those parameters. A year later, Hu et al. turned that observation into a method, LoRA, and it's been the default way to fine-tune ever since for anyone not made of GPUs.
The matrix that doesn't change
Inside a transformer, most of the heavy lifting is matrix multiplication. A weight matrix W takes an input x and produces Wx. Fine-tuning normally means nudging W itself — computing some update ΔW and setting the new weights to W + ΔW. The problem is that ΔW is the same size as W. For a single layer that might be thousands by thousands of numbers, and a model has hundreds of those layers. You're storing and updating a second full copy of the model.
LoRA's move is to not store ΔW directly. Instead it factors the update into two skinny matrices:
ΔW = B · A
If W is d × d, then A is r × d and B is d × r, where r — the rank — is small. Think 8, 16, 32. Their product B·A has the same shape as ΔW, but the two factors together hold only 2 · d · r numbers instead of d · d. With d = 4096 and r = 16, that's about 131 thousand parameters standing in for 16.8 million — roughly 0.8% of the full update, per matrix.
During training, W stays frozen. Only A and B learn. The forward pass becomes:
h = Wx + (B · A)x
The frozen path Wx carries everything the model already knew. The new path (B·A)x carries the correction you're teaching it. Add them and you get the adapted output.
Why this is such a good deal
Three consequences fall out of that picture, and each one is a reason LoRA took over.
Memory. You're not storing optimizer state for 7 billion parameters — just for the few million in A and B. Adam keeps two extra values per trainable parameter (momentum and variance), so cutting trainable params by 100× cuts the biggest chunk of training memory by about the same. A job that needed a cluster now fits on one card.
Storage and shipping. A finished LoRA adapter is a few megabytes. You keep one copy of the base model and a folder of tiny adapters — one per customer, per task, per tone. Swapping behavior means loading a different few-megabyte file, not a different 14-gigabyte model. For anyone serving many specialized variants, this is the feature that pays the rent.
No inference tax — if you want. Because B·A has the same shape as W, you can fold it back in once training is done: W_merged = W + B·A. The merged model is byte-for-byte a normal model with zero extra layers and zero extra latency. Or you keep the adapter separate and swap it at runtime. You get to choose per deployment.
The two knobs that matter
LoRA has fewer dials than full fine-tuning, but two of them carry most of the weight.
Rank r. This is the capacity of the update. Too low and the adapter can't represent the change you need; too high and you're spending memory for nothing and inviting overfitting on a small dataset. For most instruction-style tuning, r between 8 and 32 is the working range. Bigger datasets and bigger behavior shifts justify higher ranks. The honest answer is that you sweep it against an eval — but if you need a default, 16 rarely embarrasses you.
lora_alpha. A scaling factor. The update is applied as (alpha / r) · B·A, so alpha controls how loudly the adapter speaks relative to the frozen base. A common convention is to set alpha to twice r and leave it. If your fine-tune is overpowering the base model's general ability, turn it down; if the adaptation isn't taking, turn it up.
The other choice is which matrices to adapt. The original paper put adapters only on the attention projections (q_proj, v_proj and friends) and got most of the benefit. Later practice often targets more modules — including the MLP layers — when there's data to support it. More targets means more capacity and more memory; it's the same trade as rank, in a different costume.
from peft import LoraConfig
config = LoraConfig(
r=16,
lora_alpha=32, # alpha = 2 * r, a sane default
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
What LoRA is quietly assuming
The method works because the assumption underneath it usually holds: the behavior change you're after lives in a low-rank subspace. For adapting tone, format, task structure, or a domain's phrasing, that's true and LoRA shines. When the assumption strains — when you're trying to teach genuinely new capabilities, or cram a large amount of new factual content into the weights — a low-rank update is the wrong tool, and pushing r higher to compensate is a sign you've left LoRA's comfort zone. That's not a knock on the method. It's the same boundary every post in this series keeps running into: weights are for behavior, retrieval is for facts.
LoRA didn't make models smaller or training faster by magic. It noticed that the update was the bloated part, and shrank that instead. Everything downstream — QLoRA stacking quantization on top, the adapter zoo, the swap-at-runtime serving tricks — is built on that one observation. The next post takes the same idea and asks how cheap the frozen base itself can get.
Leave a Reply
Your email address will not be published.