QLoRA: Fine-Tuning on One GPU

Try to full-fine-tune an 8B model on a single 24 GB consumer card and you won't get to the first training step. The weights alone, in 16-bit, are about 16 GB. Then the gradients want another 16. Then Adam wants two more copies for its momentum and variance. You're asking for roughly 64 GB of memory on a card that has 24, and the run dies with an out-of-memory error before it computes a single loss.

LoRA already fixed most of that by freezing the base and training tiny adapters — no gradients or optimizer state for the big matrices. But the frozen weights themselves still sit in memory at 16 bits, and for the largest models that's still too much. QLoRA's contribution is to shrink that last stubborn block: keep the base model frozen and store it in 4 bits, while the LoRA adapters train in full precision on top. The result is the thing the title promises — fine-tuning a serious model on one GPU.

Three ideas stacked

The 2023 QLoRA paper isn't one trick; it's three, and they compose. Each targets a different slice of the memory bill.

4-bit NormalFloat (NF4). Ordinary 4-bit quantization spaces its 16 possible values evenly. But neural network weights aren't spread evenly — they cluster around zero in a roughly bell-shaped curve. NF4 is a data type whose 16 levels are placed to be information-optimal for normally-distributed data, so more of the precision lands where the weights actually are. For a frozen base whose only job is to pass signal forward, this loses startlingly little.

Double quantization. Quantization needs scaling constants to map 4-bit codes back to real numbers, and those constants take memory too. Double quantization quantizes the quantization constants — a second, smaller pass that shaves off the overhead. It sounds like splitting hairs; across a whole model it's worth a fraction of a bit per parameter, which adds up to real gigabytes.

Paged optimizers. Memory usage spikes — a long sequence in a batch can briefly blow past what the card has. Paged optimizers use the GPU's unified-memory plumbing to spill optimizer state to CPU RAM when a spike hits and page it back when the pressure drops, the same way an operating system pages to disk. It turns a hard OOM crash into a quiet slowdown.

QLoRA memory layout on a single GPU: 4-bit base, adapters, paged optimizer — The 4-bit base is dequantized on the fly; only the full-precision adapters get gradients.

The part that confuses people

If the weights are stored in 4 bits, how does the math work? Matrix multiplication doesn't run in NF4.

It doesn't have to. The weights are stored in 4 bits, but each block is dequantized back to a 16-bit float right before it's used in a matrix multiply, then thrown away. Compute happens in bf16; only storage is 4-bit. The gradients flow into the LoRA adapters — which were full precision the whole time — and the frozen 4-bit base never gets a gradient at all. So you pay a little extra compute to dequantize on the fly, and in exchange the resident memory of the base model drops by roughly 4×. On a memory-bound single-card job, that trade is almost always worth it.

This is also why QLoRA is slower per step than plain LoRA on a card that could fit both: dequantization isn't free. You're spending time to buy memory. When memory is the thing standing between you and training at all, that's a trade you take every time.

What it costs in quality

The paper's headline finding was that 4-bit QLoRA matched 16-bit full fine-tuning on its benchmarks — that the quantization, done with NF4 and the rest, was close to lossless for the purpose of training adapters. That held up well enough that QLoRA became the default for tuning large models on modest hardware, not a compromise you apologize for.

"Close to lossless" isn't "lossless," and the honest caveats are worth stating. The base model's raw quality does dip slightly at 4 bits; on tasks that lean hard on the base's most precise knowledge, a careful 16-bit LoRA can edge it out. And per-step speed is lower. If you have the VRAM for plain LoRA, use plain LoRA. QLoRA is for when you don't — and that "when you don't" covers most people fine-tuning anything above 13B on hardware they can actually rent or own.

Wiring it up

In practice QLoRA is a quantization config handed to the model loader, then ordinary LoRA on top. The whole thing is a few lines because bitsandbytes and PEFT do the heavy lifting:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",            # NormalFloat, not plain int4
    bnb_4bit_use_double_quant=True,       # quantize the quant constants
    bnb_4bit_compute_dtype=torch.bfloat16 # dequantize to bf16 for matmuls
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb,
    device_map="auto",
)
# then attach a LoRA config exactly as in the LoRA post

That BitsAndBytesConfig is the entire difference between LoRA and QLoRA at the code level. Everything else — the adapter setup, the trainer, the data — is identical. With it, an 8B base fine-tunes comfortably on a single 24 GB card, and a larger 70B-class model comes within reach of a single 48 GB one in a handful of hours rather than a cluster over a weekend.

The thing I find quietly remarkable about QLoRA isn't any one of its three ideas — it's that they're all storage tricks. None of them touch what's being learned. The adapter trains in full precision, on a frozen base that's been compressed underneath it without the training loop ever knowing. That separation is why it works, and it's the same theme the whole series keeps circling: the cheapest place to cut is the part that isn't changing.

QLoRA: Fine-Tuning on One GPU

Three ideas stacked

The part that confuses people

What it costs in quality

Wiring it up

Leave a Reply