Monolith: it works or it doesn't.
Distributed system: half of it works, the other half might be working, and you can't tell which is which.
This is the fundamental problem. In a monolith, failure is total. Process dies, everything stops. You know it's down.
In distributed systems, one service fails while nine others work. The failing service might be slow, not dead—so timeouts don't trigger. Requests pile up. Circuit breakers don't trip because technically nothing is failing, just taking 30 seconds.
User → API Gateway → Service A → Service B → ???
↓
Service C (slow)
↓
Timeout... eventually
What do you tell the user? You don't know what happened.
Networks partition. Packets drop. Latency spikes. You can't tell the difference between "slow" and "down."
Before: function call → result (0.001ms)
After: network call → maybe result (5-5000ms)
Every network boundary is a new failure mode:
You need retry logic, idempotency keys, circuit breakers, health checks. For a function call that used to be one line.
You have three servers. Each has a clock. None of them agree.
Server A: 10:00:00.000
Server B: 10:00:00.123
Server C: 09:59:59.998
Now try to order events across these servers. "This write happened before that read"—how do you know? You don't. NTP helps but doesn't solve it.
This is why distributed systems use logical clocks, vector clocks, or just avoid depending on time ordering entirely.
You've heard the elevator pitch: consistency, availability, partition tolerance—pick two.
Reality: you don't pick partition tolerance. Partitions happen. Networks fail. Your only choice is: during a partition, do you return stale data (availability) or return an error (consistency)?
Most systems choose availability. Users hate errors more than slightly stale data.
Leader receives write. Replicates to followers. Leader dies before replication completes.
Did the write happen? Depends on who you ask.
Client: "Write confirmed!"
Follower 1: "Never got it"
Follower 2: "Got it"
New leader: Follower 1
User data: Gone
This is why distributed databases have replication acknowledgment settings. Synchronous replication: slow but safe. Asynchronous: fast but lossy.
Getting multiple nodes to agree on something requires multiple round trips. Raft, Paxos, whatever—you're paying latency.
This is fine for infrequent operations. For hot paths, consensus becomes the bottleneck. That's why distributed caches exist: trade consistency for speed.
Stop thinking about distributed systems as "multiple computers running my code."
Think: "Multiple computers running my code, none of which trust each other, all of which might be lying, and the connections between them fail randomly."
Design for failure. Assume everything breaks. Test what happens when it does.
A distributed system is one where a computer you've never heard of can crash your application.
— blanho
Netflix ripped out Kafka, Cassandra, and three cache layers. Because every cache is a lie.
Synchronous calls work until they don't. Then you need a message queue. Here's why.
You optimized the wrong thing. Again. Here's why guessing at performance never works.