"why is it slow"
That's a real query a real user typed into a real RAG system. Three words, no subject, no context, and your retriever is supposed to embed it and find the right passage in a 50,000-document corpus. It can't, and it shouldn't be expected to. The embedding of "why is it slow" sits in a useless, generic neighborhood of vector space, equidistant from a thousand documents about slowness and close to none of them in particular.
Every technique so far in this series tried to retrieve better against a fixed query. Query transformation gives up that assumption. The premise: the user's question is often the worst possible search query, and the cheapest fix is to rewrite it — with an LLM — into something the retriever can actually work with, before you search.
The vocabulary mismatch nobody warns you about
Dense retrieval matches a question to a passage. But questions and answers don't look alike. A question is short, interrogative, and uses the asker's words. The passage that answers it is declarative, detailed, and uses the author's words. "How do I stop my plan auto-renewing?" and "Subscriptions renew on the billing anniversary unless cancelled 24 hours prior" are a perfect match in meaning and a weak match in vector space, because they share almost no surface form and sit in different regions the embedding model learned for "asking" versus "explaining."
Query transformation closes that gap from the query side. Three flavors, each for a different kind of broken query.
HyDE: search with a fake answer
HyDE — Hypothetical Document Embeddings, from a 2022 paper — is the one that sounds like it shouldn't work. Instead of embedding the question, you ask an LLM to write a fake answer to it — a plausible paragraph, possibly with invented specifics — and embed that. Then you search with the fake answer's vector.
The reason it works is the mismatch above. A hypothetical answer looks like a real answer: same register, same vocabulary, same shape. Its embedding lands in the part of the space where real answers live, not where questions live. You're searching answer-to-answer instead of question-to-answer, which is the comparison embeddings are better at.
It does not matter that the fake answer is factually wrong — it might claim the threshold is 48 hours when it's really 24. You never show it to anyone; you only use its vector to find real documents, and the real documents supply the real facts. HyDE shines on short, vague, or jargon-light queries where the question alone gives the retriever nothing to grab. It costs one extra LLM call before retrieval, and it can backfire on genuinely novel questions where the model's hypothetical drifts somewhere no real document lives.
Multi-query: ask it five ways
A single query is a single shot into vector space. Phrase it slightly differently and you'd land in a slightly different neighborhood and pull slightly different chunks. Multi-query turns that fragility into coverage: ask an LLM to generate three to five paraphrases of the question, retrieve for each, and pool the results.
One variant phrases it formally, another casually, another with the likely technical term — and between them they cover several neighborhoods of the space instead of betting everything on one. You dedupe the union (the same strong chunk often surfaces for several variants, which is itself a confidence signal) and fuse with RRF, exactly the rank-fusion from the hybrid-search post. The combination — multiple query variants, each maybe run as hybrid search, all fused — is what people sometimes label "RAG-Fusion." It's just multi-query plus the fusion you already know.
Decomposition and step-back: for questions with parts
Some queries fail not because they're vague but because they're compound. "Did revenue grow faster than headcount between the 2022 and 2024 reports?" is three retrievals wearing a trench coat: revenue figures, headcount figures, and the comparison. Embed the whole thing and you get a vector that's a muddy blend pointing nowhere precise.
Sub-question decomposition has the LLM break the question into atomic parts, retrieve for each, and assemble the answer from all the evidence. Multi-hop questions — where you need fact A to even know what fact B to look for — need this; a single retrieval physically cannot gather chained evidence.
Step-back prompting (the 2023 "Take a Step Back" paper) goes the other direction: instead of splitting down, it abstracts up. Before answering a narrow question like "what's the rate limit on the v3 batch endpoint for tier-2 accounts?", it first asks the broader "how does rate limiting work on this API?", retrieves the general context, and uses it to ground the specific answer. Good when the specific query is too narrow to match anything, but the principles live in a broader passage.
The bill comes due
Now the caution, because every one of these spends LLM calls to save retrieval misses, and that trade isn't always worth it.
HyDE adds one generation before every search. Multi-query multiplies your retrieval load by the number of variants and adds a generation to produce them. Decomposition can fan a single user question into half a dozen sub-retrievals plus a synthesis call. You've turned one cheap vector lookup into a small pipeline — more latency, more tokens, more cost, more things that can fail. On an easy query where plain retrieval already nails it, all of that machinery buys you nothing and just makes the answer slower and pricier.
Which is the real lesson, and it sets up the next two posts. Query transformation shouldn't run on every query — it should run when the query needs it. A clear, specific, well-phrased question wants none of this. A vague three-word fragment wants HyDE. A compound question wants decomposition. Deciding which treatment a query gets, on the fly, instead of applying one pipeline to everything — that's the jump from a static RAG pipeline to a system that reasons about its own retrieval. Which is exactly where agentic RAG comes in.
For now: reach for query transformation when you've measured that retrieval is missing and you've ruled out the cheaper fixes. Add HyDE for vague queries, multi-query for recall, decomposition for multi-part questions. Just don't bolt all three onto every request and call it sophistication. It's mostly latency.
Leave a Reply
Your email address will not be published.