There's a demo that always lands. You show a model producing a flawed answer, then ask it "are you sure?", and it catches its own mistake and fixes it. The room nods. Self-correcting AI. Magic.
Now read the TACL 2024 paper When Can LLMs Actually Correct Their Own Mistakes? and the magic gets a lot more conditional. The short version of a careful survey: prompted self-correction, where a model critiques itself with nothing but its own judgment, is mostly overstated. Sometimes it degrades the answer — the model talks itself out of a correct response. Reflection works, but not because models are secretly good at grading themselves. It works when the feedback comes from somewhere the model can't just rationalize away.
So let's build the version that actually holds up.
The loop, and the lie inside it
Reflection is a feedback loop: produce a draft, critique it, refine it, repeat until it's good enough or you run out of patience. Self-Refine showed a single model doing all three roles in a loop with no extra training, and Reflexion added a memory twist — the agent writes its failures down as verbal lessons and carries them into the next attempt, hitting 91% pass@1 on HumanEval. Both are real results. Both are also where the naive reading goes wrong.
The lie is "the model improves itself." A model grading its own work brings the same blind spots to the critique that it brought to the draft. If it didn't know the fact the first time, it won't know it's wrong the second time. Self-critique without an external anchor is a confidence loop, not a correctness loop.
Two roles, not one mind wearing two hats
The first real improvement is structural: split the producer from the critic. The producer's only job is to do the task — write the code, draft the memo, build the plan. The critic is a separate call with a different system prompt and a different mandate: find what's wrong. "You are a meticulous senior reviewer. List concrete defects. Do not be agreeable."
This isn't a cosmetic trick. A fresh critique, unattached to the ego of having just written the thing, catches more than a model asked to second-guess itself in the same breath. You can push it further by making the critic a stronger model than the producer, or by giving the two genuinely different personas. The point is to manufacture the distance that an honest review needs.
def reflect(task, max_loops=3):
draft = produce(task)
for _ in range(max_loops):
critique = criticize(task, draft) # separate call, reviewer persona
if critique.verdict == "pass":
break
draft = produce(task, feedback=critique.notes)
return draft
Make the critique structured — a Pydantic object with a verdict and a list of specific notes, not a paragraph of vibes. A loop that branches on verdict == "pass" needs a field it can actually read, and forcing the critic to enumerate concrete defects stops it from hand-waving "looks good!" on a draft that isn't.
Give the critic a tool, not just an opinion
Here's where reflection stops being theater. The CRITIC paper found that self-correction works dramatically better when the critic verifies against something external — running the code, searching the web, querying a calculator — than when it introspects. That lines up exactly with the skeptical TACL result: the loop pays off in proportion to how trustworthy the feedback is.
So the most valuable critic isn't a clever prompt. It's a test suite. Generate code, run it, feed the failures back. Generate a claim, search for it, feed the contradiction back. The 2025 SETS work pushes the test-time angle — combining sampling with self-verification and correction to beat plain repeated sampling — but the load-bearing idea is older and simpler: reflection is only as good as the signal it reflects on. A unit test is a better critic than a paragraph of self-doubt, every time.
Concretely, for code that means the critic isn't a second model with opinions about style — it's the interpreter. The producer writes a function, your harness runs the test suite, and the actual stack trace and failing assertions go back as the critique. The model isn't guessing whether its code is right; it's reading the exact line where it broke. That's the version of reflection that earned Reflexion its HumanEval numbers, and it's why "let the agent debug itself" works in coding loops far better than "ask the agent if its prose is good" works in writing loops. When you can't get an external signal — pure subjective writing quality, say — keep your expectations low and lean on the separate-critic split, because that's the only correction lever you've got left.
Stop conditions, or it never stops
Two failure modes will bite you, and both are about knowing when to quit.
The obvious one: no stop condition is an infinite loop. A producer and a critic with no cap will refine forever, or worse, oscillate — fix A, which breaks B, fix B, which breaks A. Always cap iterations. Always define "good enough" as something the loop can check.
The subtle one: diminishing returns. The first critique usually catches the real problems. The second catches less. By the third or fourth, you're often paying two LLM calls per round to change a comma. Measure the lift. If iteration N+1 doesn't move your metric, the loop should have ended at N. Reflection's own rule of thumb is that it's for when quality matters more than speed and cost — which means you owe it to yourself to check that the quality is still climbing before you keep buying it.
Reflection is one of the few patterns that's genuinely powerful and genuinely overhyped at the same time. Give it a separate critic and an external check and it'll catch things a single pass never would. Hand it nothing but a model asking itself "are you sure?" and you've built an expensive way to feel reassured.
Leave a Reply
Your email address will not be published.