Agentic RAG: When Retrieval Starts Thinking

Every pipeline in this series so far has been a conveyor belt. Query goes in one end, passes through retrieve, rerank, generate, and an answer drops out the other. The belt runs the same way every time. It doesn't look at what it retrieved and think "this is thin, let me try again." It can't. There's no it there to think — just stages firing in sequence.

Agentic RAG breaks the belt. The shift is small to describe and large in consequence: you give the model the retriever as a tool it decides when to use, instead of a step that always runs. Retrieval stops being something that happens to the query and becomes something the model does, on purpose, possibly more than once, possibly not at all. That single change — control flow handed to the model — is the whole idea. Everything else is mechanics.

From a pipeline to a loop

In static RAG, the order is fixed and the model is the last stage. In agentic RAG, the model is the controller, and it runs a loop: look at the question, decide whether and what to retrieve, look at what came back, judge whether it's enough, and either answer or go again.

The agentic RAG loop: plan, retrieve, grade, then answer or reformulate and retry — The Grade state is the dividing line — a system that checks its own evidence.

The new box, the one no earlier pipeline had, is Grade. Between retrieving and answering, the system stops and asks: is this evidence actually good enough to answer with? That question — self-assessment before generation — is the dividing line between a pipeline that hopes and a system that checks.

Grading what you got

Naive RAG's worst habit is answering confidently from bad context, because nothing ever inspects the context. Two research lines from 2023–2024 made the inspection explicit, and they're the backbone of most agentic RAG you'll build.

Self-RAG trains the model to emit little control tokens as it works — signals for should I retrieve here?, is this passage relevant?, is my sentence actually supported by it?. The model critiques its own retrieval and its own draft, inline. The deep insight isn't the specific tokens; it's that the model can be taught to reflect on whether retrieval helped, rather than blindly stuffing whatever came back into the answer.

Corrective RAG (CRAG) adds a lightweight grader that sorts retrieved documents into correct, ambiguous, or wrong. If the evidence looks solid, proceed. If it's weak or off-topic, don't answer from it — do something about it, most usefully fall back to a web search or a broader retrieval to find better grounding. The principle that matters: a confident answer on bad evidence is worse than the effort of going to get good evidence.

Strip the training details and both reduce to the same move — check the evidence before you trust it — which is exactly the Grade box above.

What the loop buys you

Because the model controls the loop, behaviors that needed a hand-built pipeline before now fall out for free.

It retrieves only when needed. Ask an agentic system "what's 17% of 240?" or "rewrite this politely" and it just answers — no retrieval, because the model knows it doesn't need the corpus. Static RAG retrieves on every query, pays the latency every time, and pollutes simple requests with irrelevant chunks. Letting the model skip retrieval is underrated; a lot of "bad RAG answers" are really "retrieval that should never have run."

It retrieves more than once, adaptively. Multi-hop questions — "which of our datacenters is in the region affected by the policy in doc X?" — need you to retrieve doc X, read it to learn the region, then retrieve again for datacenters in that region. A single shot can't do this; the second query doesn't exist until you've read the first result. The loop handles it naturally: retrieve, learn, retrieve again.

It reformulates after a miss. First search comes back thin? The model rewrites the query — applying exactly the HyDE and decomposition tricks from the last post, but now triggered by a judged failure instead of run blindly up front — and tries again. The query transformation that we said shouldn't run on every query? This is the system deciding, per query, when it should.

This is also where the lines from the calendar — "plan, route, act, verify, stop" — stop being a slogan and become the actual states of a running program. The model plans an approach, routes to the right source, acts by retrieving or calling a tool, verifies what it got, and stops when it's satisfied or out of budget.

The part the demos skip

Here's my skeptical note, and I'll be blunt because the hype deserves it. Agentic RAG is not strictly better than the pipeline. It trades determinism for adaptability, and that trade has a bill.

Latency and cost multiply. Each loop iteration is at least one LLM call, often several — plan, grade, reformulate, generate. A question that took one retrieval and one generation now takes five model calls and three retrievals. For a user waiting on an answer, "thinking" looks a lot like "slow." For your invoice, it looks like 5×.

Loops don't always converge. A model that keeps grading its evidence as insufficient can retrieve, reformulate, retrieve, reformulate, and never decide it's done. You need a hard iteration cap and a graceful "I couldn't find enough to answer" exit, or the agent spins. The Stop state isn't optional polish; it's load-bearing.

Failure gets harder to trace. When a conveyor belt produces a wrong answer, you can inspect each stage. When an agent that made eleven decisions produces a wrong answer, you're debugging a transcript, not a pipeline. The flexibility that helps the agent recover is the same flexibility that makes its mistakes weird and one-off.

So don't reach for agentic RAG because it's the newest box on the diagram. Reach for it when your queries actually demand it: multi-hop questions, a mix of "needs the corpus" and "doesn't," cases where a wrong-but-confident answer is expensive enough to justify paying for the model to check its own work. For a corpus of simple factoid lookups, a well-built static pipeline with good retrieval and a reranker will be faster, cheaper, more predictable, and — often — just as accurate.

The honest framing: agentic RAG isn't an upgrade you apply to every system. It's a different point on the cost-versus-capability curve, and most of the engineering is in deciding which queries are worth moving up that curve. Which is the whole subject of the adaptive-RAG post a couple of stops from here — because the smartest agentic system is the one that knows when not to be agentic.

Agentic RAG: When Retrieval Starts Thinking

From a pipeline to a loop

Grading what you got

What the loop buys you

The part the demos skip

Leave a Reply