Monitoring vs observability

You've built a system, provisioned it as code, and shipped it through a pipeline. Now it's running and serving real users — and a new, unending job begins: knowing whether it's actually healthy, and figuring out why when it isn't. The two words you'll hear constantly here are monitoring and observability. People often use them interchangeably, but they name two genuinely different things, and confusing them is one of the most common and expensive gaps in this whole field. This lesson draws the line clearly, because every later lesson stands on it.

Monitoring: watching for the failures you predicted

Monitoring means watching a system against questions you decided to ask in advance. You think about how your system might fail, and you set up dashboards and alerts for those scenarios: "graph the CPU," "alert me if the error rate goes above 1%," "page someone if the disk fills up." When one of those predefined conditions trips, you find out.

Monitoring is essential and you will always do it. But notice its hidden assumption: you already knew what to look for. It answers questions you had the foresight to ask. The industry's term for these is known-unknowns — things you knew could go wrong (a known category of problem), even if you didn't know whether they would (the unknown part). Disk filling up, CPU spiking, a dependency timing out — all anticipated. You built a watch for each one.

The problem: the failures you didn't predict

Modern systems break in ways nobody anticipated. Picture a typical setup: a request flows through a dozen small services, three databases, a cache, a message queue, and two third-party APIs, all running as containers that move between machines. Now production reports: "checkout is slow, but only for users in Brazil, only on Android, only when they have more than five items in their cart."

No dashboard you built in advance covers that. You never imagined that exact combination — it's an unknown-unknown, a failure from a category you didn't even know existed. And here's the crucial part: you can't pre-build a dashboard for a question you didn't know you'd need to ask. Monitoring, by its nature, can only show you the things you thought of ahead of time.

To debug that Brazil/Android/big-cart problem, you don't need another pre-made graph. You need to interrogate your system live — to slice the data by country, then by device, then by cart size, following the clues wherever they lead, asking questions you're inventing on the spot. That capability is observability.

Observability: asking new questions of a running system

Observability is the ability to understand a system's internal state from the outside, well enough to ask and answer new questions you didn't anticipate — without shipping new code to do it. The word is borrowed from control theory, where a system is "observable" if you can infer everything about its internal state from its outputs.

The practical difference comes down to the shape of the data you collect:

Monitoring tends to rely on pre-aggregated numbers — "total error rate," "average latency." Cheap and great for known questions, but the detail that would let you slice by country-and-device-and-cart-size has already been averaged away.
Observability relies on keeping data high-dimensional and high-cardinality — events rich enough that you can group and filter by many attributes (country, device, cart size, build version, customer tier) after the fact. The questions don't have to be decided in advance, because the raw detail is still there to query.

:::note Cardinality — a word you'll meet again Cardinality is the number of distinct values an attribute can take. country has ~200 values (low cardinality); user_id has millions (high cardinality). Observability wants high-cardinality detail so you can slice finely — but as you'll see in lesson 6.2 and again with metrics, that same high cardinality is exactly what can blow up your storage bill. Holding both truths at once is a core skill of this chapter. :::

They're partners, not rivals

This is not "observability good, monitoring bad." You need both, and they work together:

Monitoring tells you that something is wrong, fast — a symptom-based alert fires (the subject of lesson 6.5) and wakes someone up.
Observability lets that someone figure out why, by exploring the rich telemetry and asking novel questions until the cause emerges.

A healthy operation has tight, meaningful monitoring on top of deeply observable systems. The alert is the smoke detector; observability is being able to walk through the whole house and find the fire.

Why this distinction is worth caring about

A lot of teams believe they "have observability" because they bought a dashboarding tool and made some graphs. That's monitoring — valuable, but it only ever answers the questions they already thought to ask. The moment a genuinely novel failure hits (and in a distributed system, that's most serious failures), pre-built graphs leave them blind, squinting at averages that hide the very detail they need.

The fix is to design for explorability from the start: emit telemetry rich and high-dimensional enough that future-you can ask questions present-you hasn't imagined yet. The rest of this chapter is largely about how — the kinds of telemetry to emit (the pillars), the standard way to emit it (OpenTelemetry), and how to turn it into reliability you can manage (SLOs and the SRE discipline).

Why it matters

Monitoring watches a system against failure modes you predicted in advance — it catches known-unknowns via pre-built dashboards and threshold alerts, and you'll always need it. But modern distributed systems mostly fail in ways nobody anticipated — unknown-unknowns — and you can't pre-build a dashboard for a question you didn't know to ask. Observability closes that gap: by keeping telemetry rich and high-dimensional, it lets you ask and answer new questions about a running system on the spot, which is what real debugging of distributed systems requires. They're partners: monitoring tells you that something's wrong; observability lets you discover why. Next, we look at the raw materials of observability — the pillars of telemetry.

Next: The pillars: metrics, logs, traces (and profiling) →

Monitoring: watching for the failures you predicted​

The problem: the failures you didn't predict​

Observability: asking new questions of a running system​

They're partners, not rivals​

Why this distinction is worth caring about​

Why it matters​