The first RAG demo always works. You point a script at a folder of PDFs, embed the chunks, wire up a vector store, and ask it a question you already know the answer to. The model answers. Everyone nods. Ship it.
Then a real user asks a real question, and the thing falls over.
I've watched this happen enough times to stop being surprised by it. The gap between the demo and the deployment isn't a bug you can patch — it's baked into what "naive RAG" actually does. So before we spend eleven more posts on rerankers and graph indexes and query rewriting, it's worth being honest about the baseline: the plain pipeline, why it works just often enough to be dangerous, and where exactly it leaks.
The pipeline in four boxes
Naive RAG — the term comes from a 2023 survey that split the field into naive, advanced, and modular flavors — is the version everyone builds first. It's three moves at index time and two at query time.
Split documents into chunks. Run each chunk through an embedding model to get a vector. Store the vectors. At query time, embed the question, grab the k nearest chunks by cosine similarity, paste them into the prompt with an instruction like "answer using only this context," and let the model write.
That's it. No reranking, no query rewriting, no metadata filters, no graph. A working version is maybe forty lines with LangChain or LlamaIndex. The simplicity is the whole appeal, and it's also the whole problem.
Where it leaks
The failures aren't random. They cluster.
Retrieval is a similarity match, not an understanding. The vector store returns chunks whose embedding is close to the question's embedding. Close is not the same as answers. Ask "what changed in the refund policy last quarter?" and you'll happily get the three chunks that talk most about refund policy — none of which mention the change, because the chunk describing the change uses the word "amended" and never says "refund." Dense retrieval is fuzzy by construction. It rewards topical overlap and quietly punishes anything phrased in vocabulary the question didn't use.
The chunk boundary cuts through the answer. Fixed-size chunking — say 500 tokens with a bit of overlap — has no idea where ideas begin and end. The sentence that defines a term lands in chunk 7; the sentence that uses it lands in chunk 8. Retrieve one without the other and the model gets half a thought. You've handed it a paragraph that starts mid-argument and stops before the conclusion.
Embeddings flatten meaning into a single point. A 1,000-token chunk that covers three subtopics gets one vector — the average of everything in it. If the user asks about subtopic two, that chunk's vector is a blurry compromise that may not surface at all, outcompeted by a chunk that's entirely about something topically adjacent but useless.
More context doesn't mean a better answer. Even when the right chunk is in the top-k, it might be sitting in position 6 of 8. Models don't read the middle of their context as carefully as the ends — the "lost in the middle" effect Liu and colleagues documented in 2023, where accuracy sags into a U-shape as the relevant passage drifts toward the center of a long prompt. So you can retrieve the correct passage and still watch the model talk past it.
And there's no one checking the work. Naive RAG retrieves once and generates once. If retrieval whiffs, the model doesn't notice — it just answers from whatever it got, padded out with parametric memory. That's the part users actually hate: not "I don't know," but a confident, fluent, wrong answer built on three irrelevant chunks.
A number worth sitting with
Here's the uncomfortable framing that runs through this whole series. On hard factual question-answering benchmarks, a careful, fully-built-out RAG stack — hybrid retrieval, reranking, the works — tops out somewhere around the low-to-mid sixties in percent of questions answered correctly. The naive pipeline lands closer to the mid-forties.
Two readings of that gap, both true. The pessimist's: even the good version misses a third of the time, so calibrate your expectations and your UX accordingly. The optimist's, and the reason this series exists: there's a twenty-point spread between the pipeline you build in an afternoon and the one you build on purpose. Almost every point of that spread comes from fixing the leaks above, one technique at a time.
Why the demo lied to you
The demo worked because you tested it in the easy regime. Your test question shared vocabulary with the source text. The answer lived inside a single chunk. There was one obviously relevant document, not forty near-duplicates. The corpus was small enough that top-k was basically the whole haystack.
Production violates every one of those. Vocabulary drifts — users ask in their words, not the document's. Answers span chunks, sections, sometimes whole documents. The corpus has fifty things that look relevant and three that are. And the moment the index grows, top-k stops being "the obvious matches" and becomes "five plausible-looking chunks, two of which are traps."
None of this means naive RAG is useless. For a small, clean, well-written corpus where users ask in the document's own language, it's genuinely fine — and you should not build a GraphRAG pipeline to search a 20-page handbook. Match the machinery to the problem. The mistake isn't using naive RAG; it's assuming the demo's behavior will survive contact with real questions.
What "fixing it" actually means
Every later post in this series is a patch on one of these leaks, so it's worth naming the map now:
- Retrieval misses on vocabulary → hybrid search (lexical BM25 alongside dense vectors) and query transformation (rewrite the question into the document's language before searching).
- Chunk boundaries cut answers → chunking strategies that respect structure, plus context-aware tricks like prepending a short document-level summary to each chunk.
- The right chunk ranks too low → reranking with a cross-encoder that actually reads query and chunk together instead of comparing two pre-computed points.
- No one checks the work → agentic and corrective RAG, where the system retrieves, grades what it got, and retrieves again if it's thin.
- "Is this even working?" → evaluation, because you can't improve a number you refuse to measure.
That's the arc. Naive RAG isn't the thing you ship — it's the thing you measure everything else against. Build it first, write down what it scores on your own questions, and treat every technique after this as a hypothesis you have to prove beats the baseline. Plenty of them won't, for your corpus. The only way to know is to have the baseline sitting there, disappointing you honestly.
Start there. The disappointment is the most useful thing the simple version gives you.
Leave a Reply
Your email address will not be published.