RAG·Sep 2, 2025·6 minrag llm production

RAG in Production

A support agent escalated a ticket because the assistant kept telling customers about a discount that ended three weeks ago. The retrieval was perfect. The reranker was working. The generation was faithful — faithfully citing a promotion document somebody had "deleted" from the CMS but never from the vector index. The system was doing everything this series taught it to do, and it was confidently, fluently wrong, because nobody had built the boring part: making the index forget.

That's production RAG. Every post before this one was about making retrieval smarter. This one is about the unglamorous machinery that decides whether a smart pipeline survives contact with a live corpus, real users, and an on-call rotation. None of it is clever. All of it is what actually breaks.

Freshness: your corpus is a moving target

The demo indexes a static folder once. Production indexes a corpus that changes every day — documents added, edited, deleted — and every change is a chance for the index to drift out of sync with reality.

Re-embedding everything nightly is the lazy answer, and it stops being viable the moment the corpus is large (the embedding bill) or the moment "yesterday's snapshot" isn't fresh enough (a policy changed at 9am; users get the old one until midnight). So you build incremental indexing: when a document changes, re-process only that document. Which sounds simple and hides three traps.

Production RAG: incremental indexing and deletes, tenant-filtered retrieval, caching, and logging
The boring machinery — freshness, isolation, caching, observability — is the actual product.

Multi-tenancy: the leak that ends companies

If your system serves more than one customer, team, or permission level out of one index, the scariest bug isn't a wrong answer — it's a correct answer drawn from data the user wasn't allowed to see. RAG turns a retrieval bug into a data breach, because the model will happily, fluently summarize a document that should have been invisible to this user.

The naive fix — retrieve, then filter out forbidden results — is wrong in a way that's easy to miss. Filter after the ANN search and you can retrieve ten neighbors, discard nine for permissions, and return one weak result, or zero, while the user has no idea the good answer existed and was withheld. Worse, a careless implementation filters in the application layer after the vector store already returned the rows, which is one refactor away from a leak. Permissions have to be enforced inside retrieval — metadata filters applied during the search (tenant ID, ACL tags, security labels as first-class indexed fields), or separate indexes per tenant when isolation has to be airtight. Decide this on day one. Retrofitting access control onto a shared index after launch is how the embarrassing postmortems get written.

Caching: the same questions, over and over

Users are repetitive. The same handful of questions show up thousands of times, and re-running the full retrieve-rerank-generate pipeline for each identical query is money set on fire.

Two layers help. Prompt caching, offered by the major model providers by late 2025, caches the processing of a stable prompt prefix — your system instructions, a fixed set of context — so repeated calls that share it pay a fraction of the input cost. Semantic caching goes further: embed the incoming query, and if it's near-identical to one you've answered before, return the stored answer without retrieving or generating at all. Huge latency and cost win — and a loaded gun. A semantic cache that's too loose treats "what's the 2024 policy" and "what's the 2025 policy" as the same question and serves the wrong year. And a cache entry that outlives the document it was built from is exactly the stale-discount bug again. Cache invalidation has to be wired to the same change events that drive incremental indexing, or your cache becomes a museum of answers that used to be true.

Observability: you can't fix what you can't see

When a user reports a bad answer in production, you need to reconstruct what happened, and "the LLM said something wrong" is not a debuggable statement. So you log the whole trace: the query, the retrieved chunk IDs and their scores, what the reranker did, the final prompt, the answer, the latency of each stage. With that, "bad answer" decomposes into an answerable question — was it retrieval (the right chunk never showed up), ranking (it showed up at position 9), or generation (it was right there and the model fumbled it)? That's the split from the evaluation post, now running live instead of offline.

And the offline evaluation doesn't stop at launch — it becomes a regression gate. The discipline that ships and stays shipped: keep your golden question set, and run it on every change to the prompt, the model, the chunking, the retriever. Models get deprecated and swapped. A provider updates an embedding model. Someone "improves" a prompt. Any of these can silently regress quality, and without an eval that runs on every change you find out from users, which is the most expensive possible monitoring system.

The part that isn't a checklist

I could keep listing — rate limits, fallback when the vector store is down, PII handling, the injection risks the OWASP LLM list catalogs, cost dashboards. The list doesn't really end, and that's the actual lesson, so I won't pretend it wraps up neatly.

Here's what twelve posts add up to. The retrieval techniques — hybrid search, reranking, query transformation, the agentic loop, graphs, adaptive routing — are necessary and they're the fun part and they are not what makes RAG hard in production. What makes it hard is that a RAG system is a living thing wired to a changing corpus serving real people with real permissions, and every one of those words hides an operational problem the demo never showed you. The stale document. The leaked tenant. The poisoned cache. The silent regression after a model swap.

The first post in this series said naive RAG disappoints because the demo tests the easy regime and production violates every assumption in it. Twelve posts later, that's still the whole story — it just has more failure modes now, and you've met them. The teams that run RAG well aren't the ones with the cleverest retriever. They're the ones who treated the boring machinery — freshness, isolation, caching, observability, regression evals — as the actual product, and the retrieval research as table stakes. Get the boring parts right and the clever parts finally get to matter. Skip them, and it doesn't matter how good your reranker is — you'll be the one explaining the three-week-old discount.

Leave a Reply

Your email address will not be published.