Exploration and Discovery

Most agents take the obvious path every time. That's usually what you want — and it's exactly why they never find anything you didn't already know was there.

An agent optimizing for a goal does the sensible thing: it exploits what works. Tried-and-true tool, known-good approach, shortest route to the reward. Reliable. Also a cage. An agent that only ever exploits will solve the problems you anticipated and walk straight past the better solution sitting one unexplored step to the side, because checking it would mean not taking the move it already knows pays. Discovery requires doing something that might not work. And "might not work" is precisely what a goal-maximizing agent is built to avoid.

This is the oldest tension in decision-making, and it doesn't go away: exploit what you know, or explore what you don't.

The trap of exploiting too well

Picture an agent that found one decent strategy early. It works, so the agent uses it, so it gets reinforced, so the agent uses it more. The strategy calcifies. Meanwhile a far better approach goes undiscovered forever, because reaching it required a stretch of looking worse first — and the agent, always taking the locally-best move, never paid that toll.

This is a local optimum, and it's the central failure exploration exists to fix. The unsettling lesson from the research is that aiming straight at the goal can be the thing that stops you reaching it. Lehman and Stanley made the point sharp in their 2011 novelty-search work: in deceptive problems, an agent rewarded purely for getting closer to the objective gets stuck, while one rewarded simply for doing new things — ignoring the objective entirely — stumbles into the solution more reliably. Sometimes the way forward is to stop optimizing and start wandering.

What you can actually reward instead

If "wander productively" sounds too vague to build, the research turned it into concrete signals. There's more than one way to make novelty pay.

A mind map of exploration strategies — Curiosity, novelty search, Go-Explore, and skill libraries — structured discovery that accumulates.

Curiosity turns surprise into a reward. The intrinsic-curiosity work from 2017 had an agent predict what would happen next and rewarded it for being wrong — high prediction error means "I don't understand this yet," which means it's worth investigating. The agent seeks out the parts of its world it can't yet predict, learning to explore with no external reward at all. For an LLM agent, the analogue is leaning toward the tool, the query, the branch whose outcome it's most uncertain about — uncertainty as a compass.

Novelty search rewards being different from everything tried so far, full stop. Not better — different. It keeps an archive of past behaviors and pushes toward whatever's unlike them, which is how it sidesteps the local-optimum trap that pure goal-chasing falls into.

Go-Explore added a memory twist that matters for agents: remember the promising states you reached, and instead of always exploring from scratch, return to a promising one and explore onward from there. Don't restart the search every time — build on the frontier you already found.

The version that looks like an agent

The example that made this click for me is Voyager — a 2023 agent dropped into Minecraft with no goal beyond "explore and do interesting things." What made it more than a curiosity was the skill library. When Voyager figured out how to do something — mine stone, craft a tool, fight a mob — it wrote that skill as code and saved it. Next time, it had the skill ready and could compose it into something more complex. So its exploration compounded: every discovery became a building block for the next, and the agent's capabilities grew open-endedly without anyone scripting a curriculum.

That's the shape of exploration that's genuinely useful for the agents we build. Not random thrashing — structured discovery that accumulates. An agent that tries a new approach, and when it works, keeps it, names it, and reuses it. Exploration without that retention is just expensive noise; you pay for the wandering and keep none of what it found. Exploration with memory is how an agent gets capabilities its designer never wrote.

Why this is the hard, unfinished one

I'll be honest about where this sits: of the patterns in this whole track, exploration is the least solved and the easiest to get wrong.

Exploration is expensive — you spend real time and tokens on things that won't pan out, by definition, because if you knew they'd pan out it wouldn't be exploration. It's risky — an exploring agent does unexpected things, which is the entire point and also a safety problem, which is why the serious open-ended-discovery work runs inside sandboxes with an overseer watching. And the deepest version — an agent that keeps generating genuinely novel, useful behavior indefinitely, the open-endedness that natural evolution manages and our systems don't — is an open research question, not a feature with a config flag. Most "exploration" in shipped agents is a tuned temperature setting and a hope.

So I won't tie a bow on it, because there isn't one. The agents that matter most might be the ones that surprise us — that find the approach no one specified, the way Voyager found tools no one told it to want. We are not very good at building those yet. We're good at building agents that do the expected thing well, and there's a quiet irony in that: the more reliably an agent hits the target you set, the less likely it is to ever find the one you didn't. Whether that trade is worth loosening — how much surprise you can afford in exchange for discovery — is, for now, still a question each system has to answer for itself.

Exploration and Discovery

The trap of exploiting too well

What you can actually reward instead

The version that looks like an agent

Why this is the hard, unfinished one

Leave a Reply