We optimized our API to handle 10x more requests per second. Throughput graphs looked amazing. Then users started complaining the app felt sluggish.
How do you make a system faster and slower at the same time? By confusing throughput with latency.
Latency is how long one request takes. User clicks, user waits, user sees result. Measured in milliseconds.
Throughput is how many requests per second. System capacity, not user experience. Measured in RPS.
The trap: they often move in opposite directions.
Imagine a restaurant with one chef. Each dish takes 10 minutes. Latency: 10 minutes. Throughput: 6 dishes per hour.
Now add a queue system. Customers order ahead, the chef batches similar dishes, cooks five steaks at once instead of one at a time. Throughput jumps to 20 dishes per hour.
But your steak? It sat in queue for 25 minutes before cooking started. Latency went from 10 minutes to 35 minutes.
Higher throughput. Worse latency. Same kitchen.
Batch processing with Kafka or SQS: Higher throughput, but each message waits for the batch to fill.
Connection pooling: Handle more total requests, but individual requests queue for an available connection.
Regional load balancing: Better global distribution, but some requests route to distant servers.
Write coalescing: Fewer disk operations, but each write waits for the buffer to flush.
I've seen teams celebrate throughput improvements while users silently leave.
The dashboard looks great: average latency at 150ms, 10,000 requests per second, 0.1% error rate. Ship it.
But the p99 latency—the experience for the slowest 1% of users—is 8 seconds. Eight seconds. Why? Queuing under load. The median user is fine. The tail user waited in queue for 7.8 seconds.
Most users don't feel averages. They feel the worst-case. One bad experience and they're gone.
Redis is optimized for latency. Simple operations, in-memory, single-threaded so there's no lock delays. Sub-millisecond responses.
Kafka is optimized for throughput. It batches messages, does sequential disk writes, and happily trades latency for volume. 100ms batch waits are normal and expected.
Payment systems are optimized for correctness. Neither throughput nor latency is the priority—not double-charging people is. Accept slowness for safety.
Before optimizing, ask: what does the user feel?
For a checkout page, the user is waiting with credit card in hand. Every second costs conversions. Optimize for latency. Target p99 under 500ms.
For background job processing, the user uploaded a file and went to lunch. They're not watching a spinner. Optimize for throughput. Target 10K files per hour.
For an analytics dashboard, someone is staring at the screen waiting for numbers. Optimize for latency. Target p99 under 2 seconds.
For log ingestion, no user is waiting at all. Optimize for throughput. Target 1M events per second.
The answer is never "optimize both equally." You pick one, and you accept the trade-off on the other.
Some metrics lie. "Average latency is 150ms" hides the p99 of 8 seconds. "We handle 10K RPS" doesn't say at what latency. "System is fast" begs the question—fast for whom?
Metrics that matter look like: "p50 = 150ms, p99 = 800ms, p99.9 = 2s." Or "10K RPS at p99 under 500ms." Or "Checkout latency p99 under SLA."
Always measure both. Report both. Optimize for the one that matters to your users.
Fast for the machine and fast for the user are different problems.
— blanho
API Gateway handles the outside chaos. Service mesh handles the inside chaos.
Netflix ripped out Kafka, Cassandra, and three cache layers. Because every cache is a lie.
Synchronous calls work until they don't. Then you need a message queue. Here's why.