Skip to main content

Capacity, autoscaling & avoiding over-engineering

You now know how systems scale. This lesson is about the two judgment questions wrapped around that knowledge: how much capacity do you actually need (and how do you find out before users do), and the harder discipline — how do you stop yourself from building far more than the problem requires? The first half is capacity planning, load testing, and autoscaling. The second half is the most valuable engineering instinct there is: not over-engineering. Both are durable; both separate effective engineers from resume-driven ones.

Capacity planning: meet load before it meets you

Capacity planning is figuring out how much compute, memory, storage, and database throughput your system needs to handle expected load — before that load arrives. The goal is the narrow band between two failures: provision too little and you fall over under traffic; provision too much and you burn money on idle machines (the FinOps concern from Chapter 9).

You don't need heavy math to start. The durable approach is: estimate peak load (requests per second, concurrent users, data growth), measure how much one unit of your system can handle, divide, and add headroom (a safety margin — commonly 20–50%) for spikes and the unexpected. The crucial word is measure — guessing how much a server can handle is how outages happen. Which is exactly what load testing is for.

Load testing: find the ceiling on purpose

Load testing is deliberately throwing simulated traffic at your system in a safe environment to discover where it breaks before real users do. You ramp up virtual users until latency climbs or errors appear — and now you know your real ceiling instead of guessing it. Two open, durable, code-first tools dominate:

  • k6 — load tests are written in JavaScript; modern, developer-friendly, and easy to run in CI. You script a "virtual user" journey and tell it to run thousands in parallel.
  • Locust — load tests are written in Python; great when your team already thinks in Python, with a live web dashboard of the run.
Load tool\n(k6 /Locust)Yoursystem\n(stagingcopy)Watch:latency,\nerrorrate, saturationKnown ceiling\n+where it breaksfirstramp virtualusers\n100 → 1k →10kfind the knee

The output isn't just a number — it's which component breaks first (often the database, per the last lesson's "state is the hard part"). That tells you where to add caching, replicas, or capacity. Load testing turns capacity planning from guesswork into evidence.

Autoscaling: let the system size itself

Static capacity wastes money: traffic ebbs and flows (low at 3 a.m., high at noon), so a fleet sized for the peak sits idle most of the day. Autoscaling automatically adds capacity when load rises and removes it when load falls — you pay for what you use and still survive spikes. This is one of the cloud's signature advantages over owning fixed hardware.

There are distinct kinds of autoscaling, and in Kubernetes (Chapter 4) they have specific names worth knowing:

  • Horizontal Pod Autoscaler (HPA) — adds/removes pods (copies of your app) based on a metric like CPU or request rate. This is horizontal scaling (lesson 11.1) automated, and the most common one. It only works well if your app is stateless — another payoff of lesson 11.1.
  • Vertical Pod Autoscaler (VPA) — adjusts the CPU/memory requested by each pod (right-sizing one pod). This is vertical scaling automated. HPA and VPA can conflict, so they're used carefully.
  • Cluster Autoscaler — adds/removes the underlying nodes (machines) when there aren't enough to fit the pods HPA wants. HPA asks for more pods; if there's no room, the cluster autoscaler buys more machines.
  • Karpenter — a newer, faster node-provisioning autoscaler (originally from AWS) that picks right-sized instances quickly and flexibly. It's a dated specific; the durable idea is "automatically provision the right nodes for pending pods."

Durable layering: HPA scales the pods; the cluster autoscaler (or Karpenter) scales the machines to fit them. App-level and infrastructure-level autoscaling work together.

A caution: autoscaling is not magic. It has a reaction lag (scaling up takes seconds-to-minutes, so very sudden spikes still need headroom or queue buffering), and it can scale your bill as fast as your capacity — so pair it with budgets and limits (Chapter 9). And it cannot rescue a design with a hard bottleneck: ten app servers all hammering one un-scaled database just overload the database faster.

The other side: don't over-engineer

Here is the most valuable instinct in this entire chapter, and the gap most guides miss. Everything above describes real scale. The trap is reaching for those heavy solutions before you have the problem they solve. The dominant failure mode in modern cloud engineering is not under-building — it's over-engineering: building for a scale, a team size, and a failure mode you don't have and may never have.

The classic over-engineering moves, and the honest question for each:

  • Microservices before you need them. Splitting an app into dozens of independently-deployed services adds network calls, distributed-systems failure modes (everything in the last lesson!), and heavy operational overhead. For a small team and modest scale, a well-structured monolith (one deployable app) is simpler, faster to build, and easier to operate. Microservices solve an organizational scaling problem (many teams shipping independently), not just a technical one — so adopt them when team coordination, not traffic, is the pain.
  • Kubernetes before you need it. As Chapter 4 warned: powerful, complex, and frequently over-adopted. A few containers do not need an orchestrator. Reach for it when you genuinely have many services and real orchestration needs.
  • Multi-cloud / multi-region by default. Doubling your operational surface for resilience you don't yet require. Multi-AZ on one cloud handles almost everyone (last lesson). (Lesson 11.4 dismantles "multi-cloud is best practice" directly.)
  • Premature optimization. Hand-tuning performance, sharding, or exotic caching for a load you don't have, before measuring, is wasted effort that adds complexity and bugs. The durable rule: measure first, optimize the proven bottleneck, and only that.

:::tip The simplest-thing-that-works principle Default to the simplest architecture that solves your actual current problem, design so you could grow into more, and add complexity only when a real, measured need forces it. Every layer of sophistication has an ongoing operational cost — more to run, debug, secure, and pay for. "We might need it someday" is the most expensive phrase in architecture. This single instinct — match complexity to need — is the throughline of the whole chapter and arguably of senior engineering itself. :::

This is not an argument for sloppiness or ignoring scale. It's about appropriate engineering: you learned the heavy machinery so you'd recognize when it's genuinely warranted — and have the confidence not to deploy it when it isn't. Knowing when not to use Kubernetes is as much a mark of skill as knowing how to.

Why it matters

Capacity planning sizes a system to expected load with headroom, and load testing (k6 in JavaScript, Locust in Python) replaces guesswork with evidence about where you actually break — usually the stateful database first. Autoscaling sizes the system automatically: HPA adds pods (horizontal), VPA right-sizes them (vertical), and the cluster autoscaler / Karpenter add the machines underneath — pods and nodes scaling together — but it lags, costs money fast, and can't fix a hard bottleneck. Wrapped around all of it is the chapter's central discipline: don't over-engineer. Microservices, Kubernetes, multi-cloud, and premature optimization are traps when adopted before the problem exists. Default to the simplest thing that works, measure before optimizing, and add complexity only when a real need forces it. With how much and how simple settled, the next lesson is about deciding — and writing those decisions down.

Next: 11.4 Decisions, written down: ADRs, RFCs & DORA →