Learning and Adaptation

Here's a thing that sounds like heresy: most agents people call "self-improving" don't learn anything.

They get better within a run — they reflect, retry, fix the failing test — and then the process exits and every lesson goes with it. Run the same agent on the same task tomorrow and it makes the identical mistake, with the identical surprised tone. That's not learning. That's a goldfish with a good vocabulary.

Real learning means the next run starts smarter than this one did. Which is a higher bar than the demos suggest, and worth being precise about, because "learning" gets stretched to cover at least four different things that cost wildly different amounts.

Four things people call learning

In-context. You put examples in the prompt and the model adapts to them for that call. It's instant and it's powerful and it's also amnesia — nothing persists past the request. This is what 90% of "few-shot learning" actually is. Useful, but don't confuse it with the model changing.

Memory-based. The agent writes lessons to a store and reads them back on future runs. No retraining, no GPUs — just "last time the SQL tool needed the schema name qualified, so do that by default now." This is the cheapest durable learning, and for most agents it's the one that earns its keep.

Preference / reinforcement. You actually move the weights. DPO, PPO, GRPO — train the model on what good looks like. Expensive, slow, permanent. You do this when a behavior needs to live in the model, not in a prompt the model might ignore.

Self-modification. The agent edits its own code or toolset between runs. Rare, powerful, and the one that genuinely keeps me up: SICA, the self-improving coding agent from 2025, edited its own codebase between benchmark passes and climbed from 17% to 53% on a SWE-Bench Verified subset. It also needed a Docker sandbox and an overseer model watching it, for reasons that should be obvious.

The interesting design question is almost never "should the agent learn." It's at which of these four layers, because they have nothing in common except the marketing word.

The loop that does the work

The memory-based version is worth drawing out, because it's the one you can ship without a training cluster, and it borrows its engine from reflection.

Reflexion was the paper that made this concrete: an agent attempts a task, a critic evaluates the attempt, and instead of just retrying, the agent writes a verbal lesson to memory — "I assumed the file was JSON; it was YAML; check the extension first." That lesson rides along into the next attempt. The clever part is that the feedback is natural language, not a gradient, so there's no training step. The model improves by reading its own past notes.

An act, evaluate, reflect, store loop — The reflection loop becomes real learning only when the lesson outlives the run.

The trick that turns this from a within-run loop into actual learning is the Store step writing to a store that outlives the run. Same diagram, but if the arrow from Store lands in long-term memory instead of a local variable, tomorrow's first attempt already knows about the YAML thing. That one change is the whole difference between a goldfish and an agent that grows.

# After a failed run, distill and persist the lesson.
lesson = reflect(task, attempt, error)        # "qualify the schema name"
mem.add(lesson, user_id="agent:sql", metadata={"tool": "query"})

# At the start of the next run, load relevant lessons into the prompt.
priors = mem.search(task_description, user_id="agent:sql", limit=5)
system = base_prompt + "\nLessons from past runs:\n" + format(priors)

When you actually move the weights

Memory gets you a long way. It also has a ceiling: it can only nudge a model that's already capable of the behavior when reminded. If the base model fundamentally won't follow your tone, or keeps choosing the wrong tool no matter how you prompt, no amount of stored lessons fixes that. The behavior has to go into the model.

That's preference learning, and the 2025 default is DPO. The older RLHF recipe trained a separate reward model and then optimized against it — two models, and a reward model that could be gamed. DPO collapsed that: the paper's title literally says the language model is secretly its own reward model, so you optimize preferences directly on pairs of "better answer / worse answer." Fewer moving parts, less to hack. GRPO took it further for reasoning, dropping even the value critic and using a group of samples' mean reward as the baseline — that's the recipe behind the open reasoning models.

But notice the cost asymmetry. Memory learning is a prompt change you ship in an afternoon. Weight learning is a dataset, a training run, an eval to prove you didn't regress, and a rollback plan. Reach for it when the behavior must be reliable and intrinsic. Reach for memory when "usually, with a reminder" is good enough — which is more often than ego wants to admit.

The part that's still unsolved

Self-modifying agents are the seductive end of this. An agent that rewrites its own tools, evolves its own strategies, gets better unsupervised — AlphaEvolve did real work here, using an evolutionary loop over LLM-generated code to discover a faster 4×4 matrix multiplication and shave real cycles off datacenter infrastructure.

Read those results carefully and the pattern is narrow: a tight, automatic evaluator. AlphaEvolve worked because correctness and speed of a matmul are machine-checkable in a loop. Take away the cheap oracle and open-ended self-improvement stalls — the agent has no ground truth to climb toward, so it drifts or games whatever proxy you gave it. Sustained, general self-improvement is a research problem, not a feature you turn on.

So here's the position I'll defend: for almost everything you're building right now, "learning" should mean memory, and you should be suspicious of anything fancier until you've got an evaluator sharp enough to trust an agent editing itself against it. The goldfish problem is real, but the fix is usually a database write, not a training cluster — and definitely not letting the thing rewrite its own source.

Four things people call learning

The loop that does the work

When you actually move the weights

The part that's still unsolved

Leave a Reply