Skip to main content

Alerting that doesn't burn people out

You can see your system (the pillars), emit telemetry vendor-neutrally (OpenTelemetry), and define "healthy" numerically (SLOs and the error budget). Now the operational moment: when should a human be woken up? Get this wrong and you produce alert fatigue — so many noisy, useless pages that people stop trusting alerts and miss the real one. Get it right and alerts are rare, meaningful, and actionable. This lesson is about designing alerts that respect the human on the other end of the pager, and the on-call rotation that human lives in.

The cardinal rule: alert on symptoms, not causes

The most important principle in alerting is short and constantly violated:

:::tip Alert on symptoms users feel — not internal causes. A symptom is something the user experiences: requests failing, pages loading slowly, checkout erroring. A cause is an internal condition that might lead to a symptom: CPU at 90%, a disk 80% full, memory climbing. Alert on symptoms; investigate causes. Paging on "CPU is high" is the classic mistake — high CPU might be totally fine (a healthy service under healthy load), and you've just woken someone for nothing. If users are happy, there is no incident, no matter what the internal gauges say. :::

Why cause-based alerts are a trap:

  • They're often false alarms. High CPU, high memory, a full-ish queue — these are frequently normal. Each one is a page that didn't need to happen.
  • They miss real problems. Your CPU can be perfectly normal while checkout is totally broken for a reason CPU never reflected. Cause-alerts watch the wrong thing.
  • They multiply. A system has hundreds of internal gauges. Alert on all of them and you drown in pages, none of which directly means "users are hurting."

The discipline: the things that page a human should be symptoms — ideally tied straight to your SLOs (is the error/latency SLI breaching?). Internal causes belong on dashboards you consult during an investigation, not on the pager. This is also exactly the monitoring-vs-observability split from lesson 6.1: symptom alerts (monitoring) tell you that users hurt; you then use observability to find the cause.

Designing what to measure: RED and USE

Two simple, durable recipes tell you which signals to put on dashboards and symptom-alerts, depending on whether you're looking at a service or a resource.

RED — for request-driven services

For anything that serves requests (an API, a web service), track three things — R, E, D:

  • Rate — requests per second. How much traffic?
  • Errors — how many of those requests are failing?
  • Duration — the distribution of response times (latency), watched at percentiles like p50/p95/p99, not the average.

RED maps almost perfectly onto user-felt symptoms — Errors and Duration are what users experience — which is why RED metrics make excellent SLIs and the right basis for symptom alerts.

:::note Why percentiles, never averages, for latency An average latency hides the users who are suffering. If 99 requests take 50 ms and one takes 5 seconds, the average is a comfortable ~100 ms — while one user waited 5 seconds. Percentiles expose that: p99 is "the slowest 1% of requests took at least this long." Alert and set SLOs on high percentiles (p95/p99), because the tail is the unhappy users. (This is where exemplars from lesson 6.2 shine — click the p99 spike on the graph and jump straight to a trace of one of those slow requests.) :::

USE — for resources

For a resource (CPU, memory, disk, a connection pool, a queue), track U, S, E:

  • Utilization — how busy/used it is (e.g. % CPU, % disk).
  • Saturation — how much extra work is queued and waiting (the run queue, pending connections) — often the real early warning.
  • Errors — error counts for that resource (e.g. failed disk writes).

USE is for investigation — the dashboards you open to find the cause once a symptom alert has fired. The pairing is the whole philosophy in two acronyms: page on RED symptoms; debug with USE causes.

Burn-rate alerting: the modern, SLO-native way to page

The best symptom alert isn't a fixed threshold ("error rate > 1%") — those are arbitrary and either too jumpy or too slow. The modern approach alerts on how fast you're consuming your error budget: the burn rate.

Recall from lesson 6.4 the error budget = 100% − SLO. Burn rate is the multiple at which you're spending it relative to the SLO window:

  • Burn rate = you'll spend exactly the whole budget over the full window. Sustainable.
  • Burn rate 10× = you're spending it ten times too fast — at this pace the entire budget is gone in a fraction of the window. That's an emergency.

So you alert on burn rate, with severity scaled to speed:

  • Fast burn (e.g. 14× over the last hour — you'd exhaust the whole budget in ~2 days): page someone now. Something is actively, seriously broken.
  • Slow burn (e.g. 2× over the last 6 hours — a steady leak): open a ticket, not a 3 a.m. page. It needs attention, but not right now.
Error budget\n(100%− SLO)How fast isit\nburning?PAGE now\nactiveoutageTicket\nattend soon,don't pageNo alert\nwithinbudgetFast (e.g. 14× / 1h)Slow (e.g. 2× / 6h)At or below 1×

Why this is better than fixed thresholds: it fires on how much the user is actually being hurt and how urgently, automatically tuning sensitivity to your real reliability target. A brief blip that barely dents the budget doesn't wake anyone; a sustained outage burning the budget fast pages immediately. This is the alerting that "doesn't burn people out" — literally driven by the budget burn.

Alert fatigue: the failure mode to design against

Alert fatigue is what happens when people get too many noisy, non-actionable alerts: they start ignoring them, silencing them, or simply missing the one that mattered in the flood. It's a genuine safety and human problem — burned-out on-call engineers and overlooked outages. The antidotes are everything above plus one test:

:::danger Every page must be actionable Before any alert is allowed to page a human, it must pass one question: "When this fires, is there a clear, urgent action a human must take right now?" If the honest answer is "not really" — it's informational, or self-resolving, or just a number being high — it must not page. Demote it to a dashboard or a ticket. A page that the responder can't or needn't act on is pure fatigue, and it erodes trust in the pages that are real. :::

The named tooling: Prometheus + Alertmanager (define alert rules, route/group/dedupe them), Grafana (dashboards and alerts), and PagerDuty / Opsgenie (manage the human side — who gets paged, escalation, schedules). SLOs and their burn-rate alerts can be generated from definitions by tools like Sloth, OpenSLO, and Nobl9 so you don't hand-write the math.

On-call: the humane side

An alert has to reach a person — that's on-call: a rotation where one engineer is responsible for responding during their shift, with a defined escalation policy (who gets paged next, and after how long, if the primary doesn't acknowledge). On-call is necessary, but it's also where engineering burnout is manufactured, so healthy practices matter:

  • A sane rotation. Spread on-call across enough people that nobody's life is dominated by it (a common shape is one week every several weeks). One person perpetually on-call is a resignation in progress.
  • Compensate and protect it. On-call is real work and disruption; treat it as such (time off in lieu, pay, or reduced project load during the shift).
  • A quiet pager is the goal. The whole point of symptom-based, actionable, burn-rate alerts is that a well-run on-call shift is mostly quiet. If on-call is constantly paging, that's a bug in your alerting (or your reliability), not a fact of life — fix the alerts.
  • Hand off cleanly. Each shift hands the next a short summary of what's ongoing, so context isn't lost between people.

A healthy on-call rotation is a load-bearing part of reliability and of retaining the humans who provide it. When a page does fire, it kicks off the incident response process — the subject of the next lesson.

Common pitfalls

  • Alerting on causes (high CPU) instead of symptoms (errors/latency). The number-one source of noisy, non-actionable pages.
  • Averaging latency instead of using p95/p99. Hides the tail of users who are actually suffering.
  • Fixed-threshold alerts instead of burn-rate alerts. Arbitrary numbers that are either too jumpy or too slow; burn rate scales to real user impact.
  • Pages that aren't actionable. Anything self-resolving or informational that pages a human is pure alert fatigue — demote it.
  • A brutal on-call rotation. Too few people, no compensation, a constantly screaming pager — a direct path to burnout and missed incidents.

Why it matters

Good alerting respects the human on the pager. The cardinal rule is alert on symptoms users feel — not internal causes: page on errors and latency (use RED for services), and keep cause-signals (USE for resources) on dashboards for investigation, not on the pager. Watch latency at percentiles, never averages, because the tail is the unhappy users. The modern, SLO-native page is burn-rate alerting — fire on how fast the error budget is being consumed, with fast burns paging immediately and slow burns becoming tickets — which automatically scales urgency to real user impact and is the antidote to alert fatigue. Every page must be actionable, or it gets demoted. And all of it lands on a person, so a humane on-call rotation — adequately staffed, compensated, and mostly quiet — is itself part of reliability. When a real page fires, what happens next is incident response.

Next: Incident response & blameless postmortems →