Rightsizing & eliminating waste
In the pricing-models lesson you discounted the compute you run. This lesson tackles a deeper question: should you be running that much in the first place? Industry surveys year after year find that a large share of cloud spend — often a third or more — is simply waste: capacity that's allocated but barely used, or running and used by nobody at all. Cutting waste is the highest-return move in FinOps because, unlike a discount, it removes the meter entirely. And crucially, you must rightsize before you commit — committing to an over-provisioned fleet just locks in the waste.
Two kinds of waste: zombies and overprovisioning
There are exactly two ways money leaks out of a cloud account, and they need different fixes.
- Zombie / idle resources — things that are running but doing nothing useful: a VM nobody has logged into for months, a test environment left on over the weekend, an orphaned disk from a deleted VM, an unattached IP address, an old snapshot. The fix is simple: find it and turn it off. The hard part is finding it, which is why visibility comes first.
- Overprovisioning — things that are used, but allocated far more than they need: an 8-CPU VM averaging 5% CPU, a database sized for a peak that never comes, a container that reserves 4 GB of memory and touches 400 MB. The fix is rightsizing: shrink the allocation to fit real demand plus a sensible buffer.
Rightsizing means adjusting a resource's allocated capacity to match its actual measured usage. You can't do it from intuition — you do it from utilization data: the actual CPU, memory, disk, and network a resource consumed over time. The recurring rule of this whole chapter applies here: you cannot optimize what you cannot see.
The chronic overprovisioning problem
Why is overprovisioning so universal? Because the incentives all point one way. An engineer sizing a service asks "what if it gets busy?" and pads the number — there's no penalty for being too big (it just works), but a 3 a.m. page for being too small. Multiply that "just to be safe" instinct across every service, every team, every year, and you get fleets running at single-digit CPU utilization. The padding feels responsible to each engineer and is collectively enormous waste.
The cure isn't "guess smaller" — it's measure, then size to the measurement (plus headroom), and ideally let automation do it continuously so it can't drift back. Reactive over-padding is replaced by data-driven sizing.
Rightsizing VMs
For virtual machines, the loop is: pull utilization metrics (CPU, memory, network) over a representative window (weeks, to catch weekly peaks), find instances whose peak — not just average — sits well below their size, and drop them to a smaller instance type. Cloud-native advisors automate the analysis: AWS Compute Optimizer, GCP's and Azure's rightsizing recommendations, and third parties read your metrics and suggest a smaller shape. The key judgment is sizing to the peak with headroom, not the average — a service averaging 10% but spiking to 70% at noon needs to survive noon.
Rightsizing Kubernetes: requests, limits, and the autoscalers
Kubernetes (Chapter 4) is where overprovisioning gets most expensive and most fixable, because it's explicit in the config. Recall two settings on every container:
- Requests — what a pod reserves. The scheduler carves this much off a node for the pod whether or not it's used. Requests are what you effectively pay for — reserved capacity is unavailable to anything else.
- Limits — the ceiling a pod may burst to before it's throttled (CPU) or killed (memory, an "OOMKill").
The chronic problem: teams set requests far above real usage "to be safe." Every pod reserving 4× what it uses means the cluster needs ~4× the nodes — you're paying for a cluster mostly full of reserved-but-idle space. Fixing requests to match real usage is often the single biggest line-item cut available in a Kubernetes bill.
Three autoscalers do this work, and confusing them is a common interview and on-call mistake:
| Autoscaler | What it scales | Cost effect |
|---|---|---|
| HPA — Horizontal Pod Autoscaler | Number of pods, up/down on load (CPU, memory, custom metrics) | Adds capacity only when needed; sheds it when idle |
| VPA — Vertical Pod Autoscaler | The requests/limits of each pod to fit actual usage | Directly attacks overprovisioned requests |
| Cluster Autoscaler / Karpenter | Number of nodes in the cluster | Removes empty nodes; Karpenter also picks cheaper/spot instance types and bin-packs pods |
The durable pattern: VPA rightsizes the pods, HPA scales the count, and Karpenter/Cluster Autoscaler then provisions exactly the nodes those pods need — bin-packing them tightly and tearing down nodes that empty out. Goldilocks is a popular tool that runs VPA in recommendation mode and shows you, per workload, what the requests should be — a gentle on-ramp to rightsizing without auto-applying changes. Platforms like Cast AI and ScaleOps automate the whole loop continuously.
:::warning HPA and VPA on the same metric fight each other Don't point HPA and VPA at the same resource metric (e.g. both reacting to CPU) — VPA raises the per-pod request while HPA adds pods, and they oscillate. The common safe combo: HPA on a load signal (requests-per-second or CPU) for count, VPA in recommendation mode for sizing, applied carefully. This pairing trips up a lot of teams. :::
Autoscaling to zero
The cheapest resource is one that isn't running (the idle-is-the-enemy rule from lesson 9.1). Serverless scales to zero by design. For containers and VMs, deliberately scaling non-production environments to zero off-hours — shutting dev/staging nights and weekends — routinely cuts those environments' cost by ~70% (they're only needed ~1/3 of the week). Scheduled scale-down is the laziest, highest-return waste fix there is.
A worked example: tracing a bloated cluster
A team's Kubernetes bill is $20k/month and rising. Instead of treating it as one number, they pull utilization:
- Cluster CPU utilization sits at 12%. Huge red flag — they're paying for ~8× the compute they use.
- VPA/Goldilocks shows the main service requests 2 vCPU and 4 GB per pod but actually uses ~0.3 vCPU and 700 MB. Requests are ~6× too high.
- They lower requests to 0.5 vCPU / 1 GB (real usage + headroom). Suddenly the same pods fit on far fewer nodes.
- Karpenter bin-packs the now-smaller pods and tears down the emptied nodes, also moving the fault-tolerant ones to spot.
- They scale dev/staging to zero overnight.
Result: the cluster shrinks from, say, 20 nodes to 6, and the bill roughly thirds — without touching a line of application code or losing any capacity at peak. The whole win came from replacing "to be safe" guesses with measured requests. This is rightsizing in one trace: measure utilization → fix requests → let the node-autoscaler shrink the cluster.
Common pitfalls
- Rightsizing to the average, not the peak. Shrink too far and you throttle or OOMKill at the daily spike. Size to peak + headroom.
- Forgetting zombies. Chasing efficient sizing while orphaned disks, idle test VMs, and old snapshots quietly bill on. Hunt idle resources too.
- Committing before rightsizing. Buying commitments for an over-provisioned fleet locks in the waste at a discount — rightsize first, then commit the smaller floor.
- Pointing HPA and VPA at the same metric. They oscillate. Separate the signals.
- One-time cleanup. Rightsizing once and watching requests drift back up. It has to be continuous — which is the FinOps loop.
Why it matters
A large fraction of cloud spend is waste — zombie/idle resources (running, used by nobody: turn them off) and overprovisioning (used, but allocated far more than needed: rightsize to measured usage + headroom). In Kubernetes the waste lives in requests set "to be safe" far above real usage; VPA rightsizes the pods, HPA scales their count, and Karpenter/Cluster Autoscaler then provisions exactly the nodes needed — with Goldilocks to show recommended requests. Always rightsize before committing, size to the peak not the average, and scale non-prod to zero off-hours. But all of this only works if you can attribute cost to a team and resource in the first place — which is the foundation we build next.
Where this leads: rightsizing needs you to know which team owns which over-provisioned resource — that attribution is tagging & cost allocation, and in shared clusters it's a hard problem of its own.