Skip to main content

SLIs, SLOs & error budgets

So far you can see your system. Now comes the question that defines Site Reliability Engineering: how reliable should it even be, and how do you manage that on purpose instead of by vibes? The answer is a tight little vocabulary — SLI, SLO, SLA — and one beautiful idea that falls out of it: the error budget. Get this lesson and you have the conceptual core of SRE. We'll build it from first principles and do the small bit of arithmetic that makes it real.

Start with the heresy: 100% reliability is the wrong target

The instinct is "I want my service to never fail — 100% uptime." SRE says: don't. Chasing 100% is the wrong goal, for three hard reasons:

  1. It's impossibly expensive. Each extra nine of reliability (99% → 99.9% → 99.99%) costs dramatically more — more redundancy, more engineering, more caution. The cost curve goes vertical near 100%.
  2. Your users can't tell the difference. Above some point, the user's own network, their wifi, their ISP, their phone, the public internet all fail more often than your service does. Driving from 99.99% to 99.999% is invisible to a user whose home wifi drops more than that.
  3. It freezes you. The only way to never break anything is to never change anything. But shipping features requires change, and change carries risk. A 100% reliability target is a 0% change target — it's anti-shipping.

So the real question isn't "how do we never fail?" It's "how reliable is reliable enough?" — reliable enough that users are happy, and no more, so you can spend the leftover risk on shipping. To answer that quantitatively, we need three defined terms.

SLI — the measured number

A Service Level Indicator (SLI) is a carefully chosen, measured number that reflects the health your users actually experience. Not "CPU usage" — users don't feel CPU. The SLI is something user-facing, usually expressed as a ratio of good events to total events:

  • Availability SLI = (successful requests ÷ total requests). "99.95% of requests returned without a server error."
  • Latency SLI = (requests faster than 300 ms ÷ total requests). "99% of requests were faster than 300 ms."

:::tip Tie SLIs to real user journeys, not arbitrary internals The single biggest SLI mistake is measuring something that isn't what users feel. "Average latency across all endpoints" can look healthy while your checkout flow is on fire. Good SLIs track the critical user journeys — can users log in, search, check out — measured as close to the user as practical. An SLI that doesn't map to a user journey is a number that can be green while customers are furious. :::

SLO — the target

A Service Level Objective (SLO) is the target you set for an SLI, over a window. It's the numeric definition of "reliable enough":

SLO: 99.9% of checkout requests succeed, measured over a rolling 28 days.

That's an internal goal your team commits to. Notice it's a deliberate choice — you picked 99.9%, not 100% (impossible) and not 99% (too loose for checkout). The number should come from what users actually need for this journey.

SLA — the contractual promise

A Service Level Agreement (SLA) is a contract with your customers promising a level of service, with consequences if you miss it — refunds, credits, penalties. Two things to internalize:

  • An SLA is a business/legal promise; an SLO is your internal engineering target.
  • Your SLO should be stricter than your SLA. If you promise customers 99.9% (SLA), target 99.95% internally (SLO), so your own alarms fire and you fix things before you breach the contract and owe refunds. The SLO is the early-warning line inside the SLA.
SLI\nthe measurednumber\n(e.g. %requests OK)SLO\nyour internaltarget\n(e.g. 99.9%)SLA\ncontractualpromise\n(e.g.99.5%, looser)\n+

The error budget: the idea everything hinges on

Here's the move that makes SRE click. If your SLO is 99.9%, then you are explicitly allowing 0.1% of requests to fail. That allowed failure is not a regret — it's a budget.

Error budget = 100% − SLO. With a 99.9% SLO, your error budget is 0.1% of requests (or of time) over the window.

Let's make it concrete with arithmetic. Suppose checkout serves 10,000,000 requests over the 28-day window, SLO 99.9%:

  • Allowed failures = 0.1% × 10,000,000 = 10,000 failed requests.
  • That 10,000 is your error budget for the window — a quantity you are permitted to spend.

Or in time terms, "99.9% available over 28 days" allows roughly 40 minutes of total downtime in that window. (Each nine is ~10× stricter: 99.9% ≈ 43 min/month down, 99.99% ≈ 4.3 min/month.)

The reframe is profound: reliability stops being a binary "up/down" and becomes a budget you can spend deliberately. You haven't "failed" until the budget is exhausted. Every failed request, every risky deploy, every experiment spends error budget — and that's allowed, by design.

Spending the budget: balancing reliability against shipping speed

Why is "a budget you can spend" so powerful? Because it dissolves the eternal fight between the team that wants to ship fast and the team that wants stability. Instead of arguing by opinion, you look at the budget:

  • Budget healthy (plenty left)? You're more reliable than you need to be — go faster. Ship features, take calculated risks, run experiments. Spending budget is the point; an untouched budget means you're being too cautious and under-shipping.
  • Budget exhausted (SLO breached)? Stop spending. Freeze risky changes and redirect engineering to reliability work until you're back in budget.

This rule — written down — is the error-budget policy: an agreed, automatic answer to "do we ship or do we stabilize?" that both product and engineering signed off on in advance. It turns a political argument into a data-driven decision. (We'll see in lesson 6.5 how the burn rate — how fast you're spending the budget — drives the alerts, and the error-budget policy returns alongside toil in lesson 6.6's SRE discipline.)

:::note Error budgets gate releases A common, powerful integration: wire the error budget into your delivery pipeline (Chapter 5). If the budget is healthy, canary releases roll forward freely; if it's exhausted, the pipeline blocks risky deploys automatically. This is how "balance reliability against velocity" stops being a slogan and becomes a guardrail in the system itself. :::

Common pitfalls

  • Targeting 100%. Impossible, ruinously expensive, and anti-shipping — leaves you no error budget to spend.
  • SLIs that don't track user journeys. A green "average latency" dashboard while checkout is broken.
  • An SLA with no SLO or error budget behind it. A promise to customers you have no internal early-warning line or burn-rate alert defending (the gap that leads straight to surprise breaches — see lesson 6.5).
  • Treating the error budget as failure to avoid entirely. An always-full budget means you're under-shipping; the budget exists to be spent.
  • No written error-budget policy. Without the pre-agreed "ship vs stabilize" rule, every budget breach reopens the same political fight.

Why it matters

The foundational SRE insight is that 100% reliability is the wrong target — too costly, invisible to users, and incompatible with shipping. So you define "reliable enough" numerically with three terms: an SLI (a measured, user-facing number), an SLO (your internal target for it, tied to real user journeys), and an SLA (a looser contractual promise, with your SLO sitting stricter inside it as an early warning). The magic falls out as the error budget = 100% − SLO: the allowed failure becomes a quantity you can spend. A healthy budget means ship faster; an exhausted one means stabilize — codified in an error-budget policy that ends the velocity-vs-reliability war with data instead of opinions. Next: how the rate you're burning that budget drives alerts that don't wake people up for nothing.

Next: Alerting that doesn't burn people out →