RAG·Jul 24, 2025·5 minrag llm

Reranking: The Cheap Accuracy Win

Most accuracy improvements in RAG cost you something painful — a new index, a bigger model, a re-architecture. Reranking is the rare one that's almost embarrassingly cheap: add one model call between retrieval and generation, eat 100–300 milliseconds, and watch a chunk of your retrieval errors disappear. If I could make a team adopt exactly one technique from this series, it'd be this one, and it'd take them an afternoon.

Here's why it works, and why it works only as a second stage.

The compression problem

Go back to what your embedding-based retriever actually does. At index time it squeezes each chunk into a single fixed vector, before it has any idea what you'll ask. At query time it squeezes your question into a vector too. Then it compares those two pre-computed points with cosine similarity.

That comparison never lets the query and the document interact. The chunk's vector was frozen weeks ago; it can't emphasize the part that's relevant to your specific question, because it didn't know the question existed. You're matching two summaries that were each written in ignorance of the other. This is the architecture's original sin, and it's the price of speed: because the document vectors are precomputed, you can search millions of them in milliseconds. The blurriness is what buys the speed.

A reranker refuses that bargain — for a few candidates.

Cross-encoders read both at once

A cross-encoder takes the query and one document together, as a single input, and runs them through the model jointly. Every word of the query can attend to every word of the document. The model isn't comparing two frozen points; it's reading the pair and scoring how well this document answers this query. It sees that the question's "it" refers to the document's "the 2024 policy," that the negation in the passage flips the meaning, that the number the user wants is right there in the table. The bi-encoder retriever can't see any of that. The cross-encoder is built to.

The catch, and the reason you can't just use it for everything: a cross-encoder has to run a full forward pass per document. You can't precompute anything, because the score only exists once query and document are paired. Run it over your whole corpus and every search takes minutes. Run it over ten candidates and it takes a fraction of a second.

So the architecture writes itself. Cheap, fast, blurry retrieval narrows millions to a few dozen. Expensive, slow, precise reranking reorders those few dozen. Two stages, each doing the job it's actually good at.

A two-stage funnel: bi-encoder retrieval narrows to candidates, a cross-encoder reorders them
Cheap retrieval narrows the field; expensive reranking lifts the best chunk to the top.

Why the order matters more than you'd think

This is where reranking quietly pays for the rest of your pipeline. Recall the "lost in the middle" finding from the naive-RAG post: models read the start and end of their context far more carefully than the middle. If your retriever put the perfect chunk at position 9 of 10, the model may glide right past it.

The reranker's job isn't only to include the right chunk — retrieval usually did that — it's to put it at position 1. Get the best evidence to the top of the prompt and you fix two problems at once: you can pass fewer chunks (less context, lower cost, less distraction) and the model attends to the good one because it's at the top where attention is strongest. Reranking isn't just better recall. It's better placement, and placement is half the battle with these models.

The menu

You've got real choices in late 2025, spanning hosted-and-easy to self-hosted-and-cheap:

Start with a hosted API to prove the lift exists on your data, then decide whether to bring it in-house. Don't agonize over which reranker first; the gap between "no reranker" and "any decent reranker" dwarfs the gap between two good ones.

The two knobs

Reranking has basically two settings, and both are latency-versus-accuracy.

How many candidates you rerank. Pull the top 50 from retrieval and rerank to 5, and you've given the precise stage 50 shots at finding the answer instead of trusting the blurry stage's top 5. Push it to 100 and recall climbs while latency does too. Somewhere past a point it stops helping because the answer wasn't in the candidate set at all — the reranker can only reorder what retrieval handed it; it can't conjure a chunk that was never pulled. Garbage in, reranked garbage out.

How many you keep. Reranking lets you pass fewer chunks confidently, because the ones you keep are the genuinely best. Fewer chunks means cheaper generation and less room for the model to get distracted.

Just add it

The blunt version: if you've built hybrid retrieval and you're not reranking, you've done the hard 80% and skipped the easy 20% that makes it land. Add a reranker. Measure recall@5 before and after on your own questions — you'll usually see it jump. Then tune the two knobs against latency you can tolerate.

It is the highest ratio of accuracy-gained to effort-spent in the entire RAG stack. The reason it feels too easy is that the hard work happened upstream, in the retriever that got the right chunk into the candidate pile. Reranking is just the stage that finally puts it on top — which, as it turns out, is most of what the model needed.

Leave a Reply

Your email address will not be published.