MCP·Feb 10, 2026·5 minmcp llm engineering

Scaling MCP

A server that runs as a subprocess on your laptop never has a scaling problem. There's one client, one process, one user. The scaling questions only show up the moment you decide a server should be a shared service — one endpoint that hundreds of agents hit at once. That's the jump from "MCP as a developer toy" to "MCP as infrastructure," and it's where most of the interesting engineering lives.

The whole thing turns on one word: state.

Why state is the enemy of horizontal scale

The standard recipe for scaling a web service is dull and reliable: run many identical instances behind a load balancer, let the balancer spray requests across them, add instances when traffic grows. It works because the instances are interchangeable. Any one of them can handle any request, because they don't remember anything between requests.

The moment a server holds per-connection state, that breaks. If instance A remembers something about your session that instance B doesn't, the balancer can't freely route your next request — it has to send you back to A. Now you need sticky sessions (the balancer pins each client to one instance) or a shared session store (every instance reads session state from a common Redis-like backend). Both work. Both are operational weight — sticky sessions wreck even load distribution and complicate failover; a shared store adds a network hop and a thing that can go down.

So the scaling game for MCP is really: how little state can a server hold?

Agents hitting a load balancer that distributes requests across stateless server instances
Stateless instances behind a load balancer: any instance can answer any request.

When instances are stateless, that diagram is trivial — any arrow can hit any box. When they're stateful, every arrow secretly needs to keep hitting the same box, and the picture stops being a fan-out and starts being a routing puzzle.

Streamable HTTP is what makes remote scaling possible at all

stdio servers don't scale horizontally — they're one process per client by construction. The transport that opened the door to real deployment was Streamable HTTP, which became the recommended remote transport in the 2025-03-26 spec, replacing the older SSE-based approach. It puts an MCP server behind an ordinary HTTP endpoint, which means it can sit behind an ordinary load balancer with ordinary autoscaling — the boring, battle-tested machinery the rest of the web already runs on.

Crucially, Streamable HTTP can operate statelessly. A server can be built so each request carries everything needed to handle it, no session memory required, and then any instance answers any request. That's the configuration you want if you can get it: deploy N copies, point a vanilla load balancer at them, scale on CPU, go home.

When you genuinely need a session

Stateless is the goal, not always the reality. The HTTP transport supports sessions via an Mcp-Session-Id header precisely because some workloads are inherently stateful — a long-running task that holds intermediate results, a connection that streams a sequence of related operations, a server that maintains an expensive open handle to a backend.

When you do need state, the choices are the usual ones, ranked by how much they hurt scaling:

  1. Externalize it. Push session state into a shared store so the instances themselves stay interchangeable. This is the most scale-friendly answer — you've moved the state out of the boxes you want to be disposable.
  2. Sticky sessions. Pin each session to an instance at the balancer. Simple to turn on, but you lose even load distribution and a dead instance takes its sessions down with it.
  3. Keep state in-process and accept the limits. Fine for low scale, a trap at high scale.

The honest rule: hold session state only for the operations that truly can't be stateless, and externalize even that when you can. Every byte of per-connection memory is a constraint on how freely the load balancer can do its job.

Where the protocol is heading

The direction of travel through late 2025 and into 2026 is unmistakable: toward less state, not more. The protocol's own roadmap discussions have been explicit that statelessness is a first-class goal — a stateless protocol that scales by default, while still offering session features for the workloads that need them. The 2025-11-25 revision's experimental tasks point the same way: a model for durable, long-running work that you poll for results rather than holding a connection open the whole time, which is exactly the pattern that lets a long job survive an instance recycling under it.

I'd plan for that future now. Build new servers stateless-first. Treat any session state as a liability you have to justify, not a convenience you reach for. The teams that designed their servers around fat, sticky sessions are the ones who'll be refactoring; the teams who kept requests self-contained will mostly just turn up the replica count.

The cheaper scaling wins people skip

Before you reach for more instances, two things buy more than horizontal scale usually does:

Cache aggressively. A lot of MCP traffic is reads — list calls, resource fetches, the same lookups over and over. Caching tool results and resource contents (with sane invalidation) cuts load before it ever reaches the question of how many instances you need.

Don't over-connect. This one's free. A host wired to fifteen servers makes more connections, more discovery calls, and more confused tool selection than one wired to the three it needs. Half of "scaling MCP" is just not generating load you didn't have to.

Scaling MCP, in the end, isn't exotic. It's the same statelessness-and-load-balancers playbook every web service has used for twenty years, applied to a protocol that — once you're on Streamable HTTP — was built to let you use it. The trick is refusing to accumulate the state that would take that playbook away from you.

Leave a Reply

Your email address will not be published.