Skip to main content

Distributed-systems realities & reliability

The last lesson made scaling the stateless tier look easy. This one is about everything that's hard: the moment you have many machines talking over a network, a set of stubborn truths kicks in that no tool can wish away. Then we turn those truths into the reliability patterns that keep real systems up — redundancy, graceful degradation, and a tested disaster-recovery plan. This is the senior-engineer core of the chapter; the concepts here are about as durable as anything in computing.

The network is not your friend

In a single program, calling a function is instant and reliable. Across a network, every call can be slow, can fail, can arrive twice, or can succeed while the reply gets lost so you never find out. New engineers fall for the fallacies of distributed computing — assuming the network is reliable, fast, and free. It is none of those. Distributed systems are the discipline of building something dependable out of unreliable, independently-failing parts. Everything below follows from taking that seriously.

CAP: you can't have it all

The most famous distributed-systems result is the CAP theorem. It says that when your data is spread across multiple machines and a network partition happens — some machines can't talk to others — you can only keep one of two properties, not both:

  • Consistency (C) — every read sees the latest write; all nodes agree on the current value.
  • Availability (A) — every request still gets an answer, even if some nodes are unreachable.

(The "P", partition tolerance, isn't optional in the real world — networks will partition — so CAP really forces a choice between C and A during a partition.)

The trade-off is concrete. If a partition splits your database cluster:

  • A CP system (choose consistency) refuses to answer on the side that might be stale, rather than return a possibly-wrong value. Correct, but unavailable. (A bank balance: better to error than to show a wrong number.)
  • An AP system (choose availability) keeps answering with whatever it has, accepting that two sides may briefly disagree and reconcile later. Available, but possibly stale. (A social media like-count: better to show a slightly-off number than an error page.)

This connects to consistency models you'll hear about: strong consistency (always the latest value, the CP instinct) vs eventual consistency (replicas converge eventually, the AP instinct — fast and available, but a read might see old data for a moment). There's no universally right answer — only the right one for that data. The durable skill is asking, per piece of data: would I rather be wrong-but-up, or correct-but-down?

Why state is the hard part: scaling databases

Last lesson, statelessness let us clone app servers infinitely. The database can't be cloned that way, because every copy must agree on the data. This is why state is the hard part of scaling, and there are three core techniques, usually layered.

Replication — keep multiple copies of the database. One node is the primary (handles writes); one or more replicas copy its data. This buys two things: redundancy (a replica can be promoted if the primary dies) and read scaling via read replicas — send the many read queries to the replicas and reserve the primary for writes. Most workloads are read-heavy, so this alone goes a long way. The catch is exactly CAP: replicas lag slightly behind the primary (replication lag), so a read from a replica may be a moment stale — eventual consistency in practice.

Sharding — when even one primary can't hold all the writes or all the data, split the data across multiple independent databases by a shard key (e.g. users A–M on shard 1, N–Z on shard 2). Now writes spread across shards. Sharding is powerful but genuinely hard: cross-shard queries and transactions become painful, and choosing a bad shard key creates hot spots. It's a last resort, not a starting point.

App servers(stateless, easy toclone)("Primary DB")("Read replica")("Read replica")writesreadsreads

Durable order of operations: cache reads → add read replicas → only then shard. Reach for sharding last; it's the most complex and the hardest to undo.

Idempotency: making "try again" safe

The last lesson's queues retry failed work, and networks lose replies — so the same operation may run more than once. If "charge the customer $50" runs twice, you've double-charged someone. The defense is idempotency: designing an operation so that doing it multiple times has the same effect as doing it once.

The standard technique is an idempotency key — the caller attaches a unique ID to the request; the server records which IDs it has already processed and silently ignores duplicates. "Charge order #12345" with the same key, received twice, charges once. Idempotency is what makes retries safe, and retries are what make distributed systems reliable — so idempotency is foundational, not optional.

Retries done right: backoff, jitter, and circuit breakers

When a call fails, retrying is reasonable. But naïve retries cause two classic disasters:

  • Retry storms / thundering herd. A service hiccups; thousands of clients all retry immediately and in lockstep, hammering the struggling service even harder and turning a blip into an outage. The fix is exponential backoff (wait longer after each failure: 1s, 2s, 4s…) plus jitter (add a small random delay so clients don't all retry at the same instant). Backoff calms the load; jitter de-synchronizes the herd. Both together.
  • Hammering something that's already down. If a dependency is hard-down, every retry just wastes time and resources and makes things worse. A circuit breaker wraps the call: after too many consecutive failures it "trips open" and immediately fails fast for a cooldown period instead of trying — giving the dependency room to recover — then cautiously tests whether it's back before closing again. (The name is the household electrical breaker: it cuts the circuit to prevent a cascade.)

:::tip Three patterns that go together Idempotency + backoff-with-jitter + circuit breakers are the standard toolkit for surviving the unreliable network. Idempotency makes retries safe; backoff-and-jitter makes them gentle; circuit breakers make them stop when stopping is wiser. Designs that ignore these are where 3 a.m. cascading outages come from. This is durable and vendor-neutral knowledge. :::

Reliability patterns: surviving failure on purpose

Now we build up from those truths to keep whole systems alive.

Redundancy — have more than one of everything that matters (servers, databases, availability zones), so the loss of any one isn't fatal. No single points of failure. This is the foundation; everything else refines it.

Multi-AZ vs multi-region. Recall from Chapter 1 that a cloud region is a geographic area containing multiple isolated availability zones (AZs) — separate data centers with independent power and networking, a few miles apart.

  • Multi-AZ means running across several AZs in one region. It protects you from a single data center failing and is cheap, low-latency, and the sensible default for most production systems. Do this.
  • Multi-region means running across geographically distant regions. It protects you from an entire region going down and can put data closer to global users — but it's dramatically more complex and expensive, because now your data must be consistent across thousands of miles (hello again, CAP and latency). It's justified for true global scale or strict availability requirements, and is overkill for most. (This is the same "match complexity to need" theme; lesson 11.3 makes it a rule.)

Graceful degradation — when a non-critical dependency fails, the system sheds that feature and keeps the core working rather than crashing entirely. If the "recommended products" service is down, the store still sells products — it just hides recommendations. A degraded experience beats an error page. This is designing failure to be partial, not total.

Disaster recovery: RTO, RPO, and the untested backup

Redundancy handles routine failures. Disaster recovery (DR) is your plan for the big one — a region-wide outage, a catastrophic bug that corrupts data, an accidental mass-deletion. DR is defined by two numbers every engineer must know:

  • RTO — Recovery Time Objective: how long can you be down? "We must be back within 1 hour." It sets how fast your recovery must be (and therefore how much you invest in standby capacity).
  • RPO — Recovery Point Objective: how much data can you afford to lose, measured in time? "We can lose at most 5 minutes of data." It sets how often you back up or replicate.
last good backupDISASTERservice restoredRPO = data you mightloseRTO = time to beback up

Setting RTO/RPO is a business decision (tighter numbers cost more), and a system with no defined RTO/RPO has, in effect, decided it's fine to be down forever and lose everything — usually by accident.

And the single most important DR truth, the one that burns real companies: a backup you have never restored is not a backup — it's a hope. Backups silently fail, get misconfigured, or turn out to be unrecoverable exactly when you need them. The only proof a backup works is restoring from it on a schedule and confirming the data is intact. Test your backups by actually restoring them, regularly. An untested backup is one of the most common and most devastating gaps in real systems.

Why it matters

Once you have many machines on a network, hard truths take over: the network is unreliable, and CAP forces a choice between consistency and availability during partitions — so you pick strong or eventual consistency per piece of data. State is the hard part: scale databases with replication and read replicas first, sharding only as a last resort. Because retries and lost replies make operations run more than once, design them to be idempotent, retry with exponential backoff + jitter to avoid thundering herds, and wrap fragile calls in circuit breakers. Build reliability from redundancy (no single points of failure), default to multi-AZ (multi-region only when truly justified), and use graceful degradation so failures stay partial. Finally, define your RTO/RPO and — the part everyone skips — test your backups by restoring them. With the mechanics of scale and reliability in hand, the next lesson covers how much to provision and the discipline of not over-building it.

Next: 11.3 Capacity, autoscaling & avoiding over-engineering →