We had 15 microservices. Each one implemented its own authentication. Each one had its own retry logic. Each one logged differently. When something broke, we spent hours correlating logs across services.
Then one slow service caused a cascade. Other services retried aggressively. The slow service got slower. Everything fell over.
We were missing two things: an API gateway at the edge, and a service mesh inside.
Think of the gateway as a bouncer at a club. Everyone from the internet hits the gateway first. It checks their ID, makes sure they're not causing trouble, and only then lets them through to your actual services.
Authentication happens once, at the edge. Internal services don't need to re-check tokens. If a request made it past the gateway, it's authenticated. This alone eliminates a ton of duplicate code.
Rate limiting stops abusive clients before they touch your services. Some IP hammering you with 10K requests per minute? Blocked at the edge. Your services never even see the traffic.
Request routing becomes trivial. Want to version your API differently? Route /api/v1/users to the old service and /api/v2/users to the new one. Blue-green deployments? Just flip the routing config.
SSL terminates at the edge. Handle HTTPS once, and internal traffic can be plain HTTP—faster and simpler. The gateway holds all the certificates.
Without a gateway, every service becomes an attack surface. Every service needs auth logic. Every service handles its own rate limiting. You end up with 15 slightly different implementations of the same thing, and they all have different bugs.
Kong, AWS API Gateway, Nginx—pick one. They all solve the same problem.
The gateway handles external traffic. But what about when Service A calls Service B, which calls Service C?
Here's what happens without a mesh: C gets slow. B is waiting on C, so B's threads start piling up. B can't respond to A. A starts timing out. One slow service just took down three.
With a mesh, the story changes. When C gets slow, a circuit breaker opens. B immediately returns a fallback response instead of waiting forever. A stays healthy. The blast radius is contained to the one service that's actually broken.
The mesh gives you automatic retries with budgets—retry failed requests, but cap the total retries so you don't create a retry storm. You get circuit breakers that stop traffic to failing services, letting them recover instead of piling on more load.
There's mTLS everywhere, encrypting all internal traffic. Zero trust inside your own network. No service can impersonate another.
Observability comes built-in. Every call is traced. Request IDs flow through the entire chain. When latency spikes, you can see exactly which hop is slow.
And traffic shifting makes deployments less scary. Deploy a new version to 5% of traffic. Watch the metrics. Roll forward if it's good, roll back if it's not.
Istio, Linkerd, AWS App Mesh—they work by injecting sidecar proxies next to each service. The proxy handles all the networking logic. Your application code stays simple.
The gateway handles what I call north-south traffic—requests coming in from the outside world. It deals with authentication, rate limiting, API versioning, SSL, and request transformation.
The mesh handles east-west traffic—service-to-service calls inside your infrastructure. It manages mTLS, retry budgets, circuit breaking, distributed tracing, and traffic splitting.
The gateway protects your system from the outside world. The mesh protects your system from itself.
Both necessary. Different problems.
If you're running a monolith, you probably don't need either. Maybe a gateway if you want rate limiting, but definitely not a mesh.
With 3-5 services, a gateway starts making sense. Centralized auth is worth it. But you can still manage retries manually without too much pain.
Once you hit 10+ services, you need both. The operational complexity without a mesh gets brutal. Every incident turns into hours of log correlation.
At 50+ services, the mesh isn't optional—it's mandatory. Incidents become literally undebuggable without distributed tracing. You'll spend more time figuring out what happened than actually fixing it.
These aren't luxury components. They're what make microservices survivable at scale.
Microservices without a mesh is just distributed debugging.
— blanho
You don't have Netflix's problems. You have 3 developers and a Postgres database.
Cloudflare went from 'that CDN company' to a full cloud platform. Most startups should look there first.
Reddit moved their comments system from Python to Go. The unglamorous reality of migration.