Stateful Services Will Break Your Scaling

Last year we tried to scale our API by adding more servers behind a load balancer. Traffic was fine at 1,000 concurrent users. At 5,000, everything broke.

The load balancer was routing requests randomly. But our sessions lived in server memory. User A logs in on server 1, next request goes to server 2, suddenly they're logged out. Shopping carts disappeared mid-checkout.

The fix wasn't more servers. It was removing state from our servers entirely.

What Makes a Service Stateful

A stateless service doesn't remember anything between requests. Every request contains everything needed to process it. You could kill the server mid-request, restart it, send the same request again, and get the same result.

A stateful service holds data between requests—sessions in memory, files on local disk, locks in local variables. That data ties users to specific servers.

Think of it like a coffee shop chain. A stateless shop: you walk in, order, pay, get coffee. Any location works the same. A stateful shop: they keep your loyalty card on-site. Now you're stuck going back to that one location.

The Session Problem

Most apps track users across requests. You log in once, browse around, add items to cart. HTTP is stateless, so we use session cookies to fake continuity.

The question is: where does session data live?

In server memory: Fast, but now users are pinned to servers. Can't scale horizontally, can't restart servers without logging everyone out.

In cookies: Simple and truly stateless. Works great for small data—user ID, a token, preferences. But cookies go with every request. 5KB of session data means 5KB overhead on every single HTTP call. Mobile users feel this.

In external storage: Redis or a database holds sessions. Cookie just contains the session ID. Any server can look up the session. This is what most production systems do. The session ID in the cookie points to data in Redis. Any server can handle any request.

Files Are State Too

Sessions are obvious. Files are sneaky.

User uploads a profile photo. It lands on server 2's disk. Next request goes to server 1. Profile photo not found.

Object storage (S3, GCS, MinIO): Store files externally. All servers read/write to the same bucket. This is the standard answer.

CDN for reads: Serve static content from edge nodes. Faster for users, less load on your servers.

Generated files—PDFs, reports, exports—same problem. Generate them, push to object storage, return a signed URL. Don't store them locally.

Other Sneaky State

Caches: Local in-memory caches seem harmless. User updates their profile on server 1, cache on server 2 still has old data. Solution: centralized cache (Redis again) or cache invalidation strategies.

Scheduled jobs: Cron jobs on specific servers create implicit state. That server becomes special. Move to distributed job queues—SQS, RabbitMQ, or database-backed queues.

WebSocket connections: Long-lived connections are inherently stateful. User connects to server 1, you can't just route their messages through server 2. Solutions: sticky sessions (compromise), or pub/sub layer where any server can publish to any connection.

The Tradeoff

Stateless isn't free. External session storage adds latency—network hop to Redis on every request. Object storage is slower than local disk. Distributed caches are more complex than Map<String, Object>.

But the alternative is being unable to scale. Being unable to deploy without kicking users off. Being unable to recover from server failures gracefully.

For most web applications, the tradeoff is worth it. Make your servers interchangeable clones. Keep state in purpose-built systems designed for it.

If your server can't die without consequences, you can't scale.

— blanho

Last year we tried to scale our API by adding more servers behind a load balancer. Traffic was fine at 1,000 concurrent users. At 5,000, everything broke.

The fix wasn't more servers. It was removing state from our servers entirely.

What Makes a Service Stateful

A stateful service holds data between requests—sessions in memory, files on local disk, locks in local variables. That data ties users to specific servers.

The Session Problem

Most apps track users across requests. You log in once, browse around, add items to cart. HTTP is stateless, so we use session cookies to fake continuity.

The question is: where does session data live?

In server memory: Fast, but now users are pinned to servers. Can't scale horizontally, can't restart servers without logging everyone out.

Files Are State Too

Sessions are obvious. Files are sneaky.

User uploads a profile photo. It lands on server 2's disk. Next request goes to server 1. Profile photo not found.

Object storage (S3, GCS, MinIO): Store files externally. All servers read/write to the same bucket. This is the standard answer.

CDN for reads: Serve static content from edge nodes. Faster for users, less load on your servers.

Generated files—PDFs, reports, exports—same problem. Generate them, push to object storage, return a signed URL. Don't store them locally.

Other Sneaky State

Scheduled jobs: Cron jobs on specific servers create implicit state. That server becomes special. Move to distributed job queues—SQS, RabbitMQ, or database-backed queues.

The Tradeoff

But the alternative is being unable to scale. Being unable to deploy without kicking users off. Being unable to recover from server failures gracefully.

For most web applications, the tradeoff is worth it. Make your servers interchangeable clones. Keep state in purpose-built systems designed for it.

If your server can't die without consequences, you can't scale.

— blanho

Stateful Services Will Break Your Scaling

What Makes a Service Stateful

The Session Problem

Files Are State Too

Other Sneaky State

The Tradeoff

Related Posts

Why Every Complex System Eventually Becomes Event-Driven

Why Your Fast System Feels Slow

Your Database Is the Bottleneck

Stateful Services Will Break Your Scaling

What Makes a Service Stateful

The Session Problem

Files Are State Too

Other Sneaky State

The Tradeoff

Related Posts

Why Every Complex System Eventually Becomes Event-Driven

Why Your Fast System Feels Slow

Your Database Is the Bottleneck