Why Distributed Systems Break

Monolith: it works or it doesn't.

Distributed system: half of it works, the other half might be working, and you can't tell which is which.

Partial Failure

This is the fundamental problem. In a monolith, failure is total. Process dies, everything stops. You know it's down.

In distributed systems, one service fails while nine others work. The failing service might be slow, not dead—so timeouts don't trigger. Requests pile up. Circuit breakers don't trip because technically nothing is failing, just taking 30 seconds.

User → API Gateway → Service A → Service B → ???
                                  ↓
                               Service C (slow)
                                  ↓
                               Timeout... eventually

What do you tell the user? You don't know what happened.

The Network Is Not Reliable

Networks partition. Packets drop. Latency spikes. You can't tell the difference between "slow" and "down."

Before:  function call → result (0.001ms)
After:   network call → maybe result (5-5000ms)

Every network boundary is a new failure mode:

Timeout but request succeeded
Request failed but retry succeeded... twice
Request succeeded but response dropped
Connection reset mid-transfer

You need retry logic, idempotency keys, circuit breakers, health checks. For a function call that used to be one line.

Clocks Lie

You have three servers. Each has a clock. None of them agree.

Server A: 10:00:00.000
Server B: 10:00:00.123
Server C: 09:59:59.998

Now try to order events across these servers. "This write happened before that read"—how do you know? You don't. NTP helps but doesn't solve it.

This is why distributed systems use logical clocks, vector clocks, or just avoid depending on time ordering entirely.

CAP Theorem (The Real Version)

You've heard the elevator pitch: consistency, availability, partition tolerance—pick two.

Reality: you don't pick partition tolerance. Partitions happen. Networks fail. Your only choice is: during a partition, do you return stale data (availability) or return an error (consistency)?

Most systems choose availability. Users hate errors more than slightly stale data.

Replication Isn't Magic

Leader receives write. Replicates to followers. Leader dies before replication completes.

Did the write happen? Depends on who you ask.

Client: "Write confirmed!"
Follower 1: "Never got it"
Follower 2: "Got it"
New leader: Follower 1
User data: Gone

This is why distributed databases have replication acknowledgment settings. Synchronous replication: slow but safe. Asynchronous: fast but lossy.

Consensus Is Expensive

Getting multiple nodes to agree on something requires multiple round trips. Raft, Paxos, whatever—you're paying latency.

This is fine for infrequent operations. For hot paths, consensus becomes the bottleneck. That's why distributed caches exist: trade consistency for speed.

The Mental Model

Stop thinking about distributed systems as "multiple computers running my code."

Think: "Multiple computers running my code, none of which trust each other, all of which might be lying, and the connections between them fail randomly."

Design for failure. Assume everything breaks. Test what happens when it does.

A distributed system is one where a computer you've never heard of can crash your application.

— blanho

Monolith: it works or it doesn't.

Distributed system: half of it works, the other half might be working, and you can't tell which is which.

Partial Failure

This is the fundamental problem. In a monolith, failure is total. Process dies, everything stops. You know it's down.

User → API Gateway → Service A → Service B → ???
                                  ↓
                               Service C (slow)
                                  ↓
                               Timeout... eventually

What do you tell the user? You don't know what happened.

The Network Is Not Reliable

Networks partition. Packets drop. Latency spikes. You can't tell the difference between "slow" and "down."

Before:  function call → result (0.001ms)
After:   network call → maybe result (5-5000ms)

Every network boundary is a new failure mode:

Timeout but request succeeded
Request failed but retry succeeded... twice
Request succeeded but response dropped
Connection reset mid-transfer

You need retry logic, idempotency keys, circuit breakers, health checks. For a function call that used to be one line.

Clocks Lie

You have three servers. Each has a clock. None of them agree.

Server A: 10:00:00.000
Server B: 10:00:00.123
Server C: 09:59:59.998

Now try to order events across these servers. "This write happened before that read"—how do you know? You don't. NTP helps but doesn't solve it.

This is why distributed systems use logical clocks, vector clocks, or just avoid depending on time ordering entirely.

CAP Theorem (The Real Version)

You've heard the elevator pitch: consistency, availability, partition tolerance—pick two.

Reality: you don't pick partition tolerance. Partitions happen. Networks fail. Your only choice is: during a partition, do you return stale data (availability) or return an error (consistency)?

Most systems choose availability. Users hate errors more than slightly stale data.

Replication Isn't Magic

Leader receives write. Replicates to followers. Leader dies before replication completes.

Did the write happen? Depends on who you ask.

Client: "Write confirmed!"
Follower 1: "Never got it"
Follower 2: "Got it"
New leader: Follower 1
User data: Gone

This is why distributed databases have replication acknowledgment settings. Synchronous replication: slow but safe. Asynchronous: fast but lossy.

Consensus Is Expensive

Getting multiple nodes to agree on something requires multiple round trips. Raft, Paxos, whatever—you're paying latency.

This is fine for infrequent operations. For hot paths, consensus becomes the bottleneck. That's why distributed caches exist: trade consistency for speed.

The Mental Model

Stop thinking about distributed systems as "multiple computers running my code."

Think: "Multiple computers running my code, none of which trust each other, all of which might be lying, and the connections between them fail randomly."

Design for failure. Assume everything breaks. Test what happens when it does.

A distributed system is one where a computer you've never heard of can crash your application.

— blanho

Why Distributed Systems Break

Partial Failure

The Network Is Not Reliable

Clocks Lie

CAP Theorem (The Real Version)

Replication Isn't Magic

Consensus Is Expensive

The Mental Model

Related Posts

Your Cache Is Lying to You

Why Every Complex System Eventually Becomes Event-Driven

The Performance Intuition Most Developers Lack

Why Distributed Systems Break

Partial Failure

The Network Is Not Reliable

Clocks Lie

CAP Theorem (The Real Version)

Replication Isn't Magic

Consensus Is Expensive

The Mental Model

Related Posts

Your Cache Is Lying to You

Your Cache Is Lying to You

Why Every Complex System Eventually Becomes Event-Driven

Why Every Complex System Eventually Becomes Event-Driven

The Performance Intuition Most Developers Lack

The Performance Intuition Most Developers Lack

Your Cache Is Lying to You

Why Every Complex System Eventually Becomes Event-Driven

The Performance Intuition Most Developers Lack