RAG·Aug 13, 2025·6 minrag llm

Adaptive RAG: Matching Pipeline to Query

Eight posts in, we've built up an arsenal: hybrid search, reranking, HyDE, multi-query, the agentic loop, knowledge graphs. And here's the trap that arsenal sets — the temptation to bolt all of it onto one pipeline and run that maximal pipeline on every single query. Decompose the question, search three ways, fuse, rerank, grade, maybe traverse a graph, generate. It would be thorough. It would also be the wrong design, because you'd be spending a graph traversal's worth of compute to answer "what's your refund window?"

Adaptive RAG is the correction. The idea is almost insultingly simple once you've felt the pain it solves: don't pick a pipeline. Pick a pipeline per query. Look at the question first, judge how hard it is, and spend accordingly — nothing on the trivial ones, everything on the genuinely hard ones, and the right amount in between.

Why is one pipeline the wrong default?

Because query difficulty isn't uniform, and a fixed pipeline has to be tuned for the hardest query it might see. That leaves it permanently over-built for the easy majority and, often, still under-built for the rare monster.

Three queries, same system:

A 2022 paper put empirical weight under the intuition: large models already know high-popularity facts from pretraining, and retrieval mostly helps on the long-tail, less-popular ones — and can even hurt on the easy ones by injecting distracting context. Retrieval isn't free accuracy. On the easy slice it's negative. So the right amount of retrieval is genuinely zero for some queries, and the only way to capture that is to decide per query.

How does the system know which query is which?

You put a cheap classifier out front whose only job is to triage the incoming query into a complexity tier, then route. This is the contribution of the 2024 Adaptive-RAG paper: train a small model to predict whether a question needs no retrieval, single-step retrieval, or multi-step retrieval, and send it down the matching path.

A complexity classifier routes each query to a no-retrieval, single-step, or multi-step path
Don’t pick a pipeline — pick a pipeline per query.

The router itself can be built a few ways, cheapest first:

The router doesn't have to be smart. It has to be cheaper than the savings it produces. A classifier that costs a tenth of a retrieval and routes half your traffic to the no-retrieval or single-step path pays for itself immediately. That economic test — does the router cost less than what it saves — is the one that matters, not classifier accuracy in the abstract.

What does routing actually save?

Two things, and they pull in the same direction for once.

Cost and latency, on the easy majority. Most real query distributions are lopsided — a fat head of simple, common questions and a thin tail of hard ones. Route the head to cheap paths and your average cost and latency drop sharply, because you stopped paying monster-query prices for kitten queries.

Accuracy, on both ends. This is the part people miss: routing doesn't just save money, it can raise quality. Easy queries get more accurate when you skip retrieval, because you stop injecting distracting chunks into a question the model could answer cleanly on its own. Hard queries get more accurate because they finally get the heavy machinery they needed. The fixed pipeline was compromising both; the router lets each tier get what it actually wants.

Where routing bites back

It's not free, and the failure mode is specific: misrouting. Send a hard query down the easy path and you get a confident wrong answer with no second chance — the cheap path doesn't know it should have escalated. Send an easy query down the hard path and you've burned money but at least the answer's probably fine. So the asymmetry says: when the router is unsure, route up, not down. Over-spending is a budget problem; under-retrieving is a correctness problem, and correctness is the one users feel.

The other cost is operational, and it's the governance angle that decides whether this survives contact with production. You now run several pipelines instead of one. Each path needs its own evaluation — it's not enough to know the system is 80% accurate overall; you need to know the router's accuracy and each path's accuracy, because a great multi-hop path is worthless if the router never sends multi-hop queries to it. You need logging that records which route every query took, so when something's wrong you can ask "was this a bad answer, or a bad routing decision?" — a distinction that doesn't exist in a single-pipeline system. And you need to watch for query drift: the distribution the router was tuned on in December isn't the one it'll see in March, and a stale router silently mis-triages a growing share of traffic.

That's the real shape of adaptive RAG. Not a cleverer retriever — a dispatcher sitting in front of all the retrievers you already built, deciding how much of the arsenal each question deserves. It's the design that makes everything in the previous eight posts affordable to keep around, because you stop running all of it all the time. And it quietly changes the central question of this series from "what's the best RAG pipeline?" to "what's the best pipeline for this query?" — which, once you've felt the cost of the alternative, is obviously the better question. The remaining trick is proving the router and its paths are actually working, and that means measurement. Which is next.

Leave a Reply

Your email address will not be published.