Memory Management

You tell the agent your name on Monday. By Wednesday it asks who you are.

Nothing broke. The conversation just ran long enough that your name scrolled past the edge of the context window, and a context window has no opinion about what mattered. It holds the last few thousand tokens and lets the rest fall off the back. The agent didn't forget you the way a person forgets a name at a party. It never had anywhere to keep you in the first place.

That gap is the whole topic. And the most common fix people reach for — a bigger context window — is the wrong tool wearing the right tool's clothes.

A bigger window is not a memory

Long-context models are genuinely useful, and they solve a real problem: fitting more into one shot. A 200k-token window means you can drop a whole codebase or a long transcript into a single request. But the window empties when the request ends. Next session, blank slate. Stretching it just means you forget later and pay more to do it.

There's a second tax. Stuff everything into the window and attention gets diluted, the one fact you needed ends up buried in the middle where models attend to it least, and every token is billed again on every turn. "Just use more context" gets more expensive exactly as the agent gets more useful. Memory is a different machine, on a different clock.

Two clocks

Split agent memory into two timescales and most of the confusion clears up.

Working memory is what's in the window right now — recent turns, the last tool result, the scratchpad the model is reasoning over. Fast, free to read (it's already there), gone at session end. You manage it by editing: summarize old turns, prune stale tool dumps, keep the goal pinned near the top.

Long-term memory is whatever the agent chose to keep — stored outside the model, usually in a vector store, fetched by similarity when it's relevant again. It survives restarts. It's what lets the agent recall on Wednesday what you said on Monday. The price: every recall is an extra retrieval call plus more tokens pushed back into the window.

The model I keep is the one from the MemGPT paper: treat the context like an OS treats RAM. A small core stays resident, the bulk lives in cheaper storage, the agent pages things in and out on demand. MemGPT (now shipping as Letta) made the agent itself responsible for that paging — deciding what to write down and what to fetch back.

Three flavors of "long-term"

"Save it for later" hides three different things, and conflating them is why a lot of memory layers feel dumb.

Episodic — what happened. "On Jan 12 the user asked for a refund and I escalated." A timestamped log of events.
Semantic — settled facts. "Prefers metric units. Account tier: Pro." Distilled, not narrated.
Procedural — how to do a thing here. "When this user reports a bug, ask for the build number first." Closer to a learned habit than a fact.

A naive system dumps raw chat logs into a vector store and calls it memory. Then it retrieves the transcript of a refund request when the agent actually needed the fact "this user has refund history." Good memory consolidates: repeated episodes get promoted into durable semantic facts, stale ones expire. The Stanford generative-agents work leaned hard on this — its simulated townspeople ran a reflection step that turned streams of observations into higher-level conclusions, and that's the move that made them behave coherently over days instead of minutes.

Writing and reading, concretely

Here's the shape with a managed memory layer (Mem0), which handles the extract-and-store step so you're not hand-rolling it:

from mem0 import Memory

mem = Memory()

# After a turn: write the salient bits, scoped per user.
mem.add(
    "Prefers metric units. Account tier: Pro. Asked about a refund on Jan 12.",
    user_id="alice",
)

# Next session: pull only what's relevant to the new query.
hits = mem.search("what plan is this user on?", user_id="alice", limit=3)
context = "\n".join(h["memory"] for h in hits)

reply = llm.chat(
    model="claude-sonnet-4",
    system=f"Known about this user:\n{context}",
    messages=[{"role": "user", "content": "Can I add three seats?"}],
)

Two things matter more than the library. First: scope by user_id (or tenant, or thread). A shared memory pool across users is a data leak with a friendly API. Second: you retrieve a few memories, not all of them — which makes retrieval quality load-bearing, and retrieval quality is a RAG problem in a memory costume.

For the working-memory side you usually don't touch a vector store at all. A LangGraph checkpointer persists thread state between turns, so one conversation stays coherent without re-sending the whole history every call.

Working memory and a long-term store exchanging facts — Working memory edits the window; long-term memory persists facts and recalls them later.

The failure modes nobody puts in the demo

Demos show the happy path: agent remembers your coffee order, everyone claps. Production shows the rest.

Memory bloat. Unbounded long-term memory rots. After a few thousand entries you've got three contradictory "facts" about the same user, and retrieval surfaces whichever one embeds closest to the query — which is to say, at random. Forgetting is a feature. Timestamp facts, expire stale ones, and when two conflict, prefer the recent one. Zep's Graphiti makes this explicit by giving every fact a validity window, so the agent knows when something was true, not just that it once was.

Retrieval misses. Mediocre embeddings, or a write step that stored a wall of transcript instead of a clean fact, and the right memory sits in the store and never comes back. The agent looks like it forgot. It didn't — it failed to find. Debug memory bugs as retrieval bugs first.

It's an attack surface. Long-term memory holds PII and it accepts writes based on conversation content. So it can be poisoned: a user, or a tool returning attacker-controlled text, plants a "fact" the agent later treats as ground truth. Scope per user, sanitize what you write, and never let a memory write carry instructions.

What I'd actually build

Start with none. A single-shot task with no continuity does not need a memory layer, and adding one is just a new thing that breaks.

The moment the agent has to hold context across steps, add a checkpointer — cheap, durable working memory for the active thread. Only when it has to recall across sessions do you reach for the external store, and when you do, treat it as three jobs: write clean facts (not transcripts), retrieve a few (not all), forget on a schedule.

Memory is what turns a tool that answers into an assistant that knows you. It's also the part most likely to quietly hand someone else's data to the wrong person. Build it like it's both.