Eight posts in, we've built up an arsenal: hybrid search, reranking, HyDE, multi-query, the agentic loop, knowledge graphs. And here's the trap that arsenal sets — the temptation to bolt all of it onto one pipeline and run that maximal pipeline on every single query. Decompose the question, search three ways, fuse, rerank, grade, maybe traverse a graph, generate. It would be thorough. It would also be the wrong design, because you'd be spending a graph traversal's worth of compute to answer "what's your refund window?"
Adaptive RAG is the correction. The idea is almost insultingly simple once you've felt the pain it solves: don't pick a pipeline. Pick a pipeline per query. Look at the question first, judge how hard it is, and spend accordingly — nothing on the trivial ones, everything on the genuinely hard ones, and the right amount in between.
Why is one pipeline the wrong default?
Because query difficulty isn't uniform, and a fixed pipeline has to be tuned for the hardest query it might see. That leaves it permanently over-built for the easy majority and, often, still under-built for the rare monster.
Three queries, same system:
- "What is your company's name?" — the model may already know this, or one chunk has it. Running multi-query decomposition and a reranker here is pure waste: latency and tokens spent to arrive at an answer a single lookup nailed.
- "What's the late-payment fee?" — one clean retrieval, one generation. The standard pipeline. Anything fancier is overkill.
- "How did our refund policy change between the 2023 and 2025 handbooks, and which customer segments did it affect?" — multi-hop, multi-document, comparative. This one needs decomposition, multiple retrievals, maybe the agentic loop. Give it a single top-k search and it fails.
A 2022 paper put empirical weight under the intuition: large models already know high-popularity facts from pretraining, and retrieval mostly helps on the long-tail, less-popular ones — and can even hurt on the easy ones by injecting distracting context. Retrieval isn't free accuracy. On the easy slice it's negative. So the right amount of retrieval is genuinely zero for some queries, and the only way to capture that is to decide per query.
How does the system know which query is which?
You put a cheap classifier out front whose only job is to triage the incoming query into a complexity tier, then route. This is the contribution of the 2024 Adaptive-RAG paper: train a small model to predict whether a question needs no retrieval, single-step retrieval, or multi-step retrieval, and send it down the matching path.
The router itself can be built a few ways, cheapest first:
- A small classifier model trained on labeled examples of each complexity tier — fast, cheap, deterministic, but you need training data and it needs retraining as query patterns drift.
- An LLM prompted to classify — "rate this query's complexity and return a route" — flexible and zero training data, but it adds a model call to every request and can be inconsistent.
- Rules and heuristics — query length, question words ("compare," "summarize," "relationship"), embedding-similarity to known query clusters. Crude, transparent, and a perfectly reasonable starting point before you reach for anything learned.
The router doesn't have to be smart. It has to be cheaper than the savings it produces. A classifier that costs a tenth of a retrieval and routes half your traffic to the no-retrieval or single-step path pays for itself immediately. That economic test — does the router cost less than what it saves — is the one that matters, not classifier accuracy in the abstract.
What does routing actually save?
Two things, and they pull in the same direction for once.
Cost and latency, on the easy majority. Most real query distributions are lopsided — a fat head of simple, common questions and a thin tail of hard ones. Route the head to cheap paths and your average cost and latency drop sharply, because you stopped paying monster-query prices for kitten queries.
Accuracy, on both ends. This is the part people miss: routing doesn't just save money, it can raise quality. Easy queries get more accurate when you skip retrieval, because you stop injecting distracting chunks into a question the model could answer cleanly on its own. Hard queries get more accurate because they finally get the heavy machinery they needed. The fixed pipeline was compromising both; the router lets each tier get what it actually wants.
Where routing bites back
It's not free, and the failure mode is specific: misrouting. Send a hard query down the easy path and you get a confident wrong answer with no second chance — the cheap path doesn't know it should have escalated. Send an easy query down the hard path and you've burned money but at least the answer's probably fine. So the asymmetry says: when the router is unsure, route up, not down. Over-spending is a budget problem; under-retrieving is a correctness problem, and correctness is the one users feel.
The other cost is operational, and it's the governance angle that decides whether this survives contact with production. You now run several pipelines instead of one. Each path needs its own evaluation — it's not enough to know the system is 80% accurate overall; you need to know the router's accuracy and each path's accuracy, because a great multi-hop path is worthless if the router never sends multi-hop queries to it. You need logging that records which route every query took, so when something's wrong you can ask "was this a bad answer, or a bad routing decision?" — a distinction that doesn't exist in a single-pipeline system. And you need to watch for query drift: the distribution the router was tuned on in December isn't the one it'll see in March, and a stale router silently mis-triages a growing share of traffic.
That's the real shape of adaptive RAG. Not a cleverer retriever — a dispatcher sitting in front of all the retrievers you already built, deciding how much of the arsenal each question deserves. It's the design that makes everything in the previous eight posts affordable to keep around, because you stop running all of it all the time. And it quietly changes the central question of this series from "what's the best RAG pipeline?" to "what's the best pipeline for this query?" — which, once you've felt the cost of the alternative, is obviously the better question. The remaining trick is proving the router and its paths are actually working, and that means measurement. Which is next.
Leave a Reply
Your email address will not be published.