Skip to main content

Chapter 6 checkpoint

You can now see into a running system and run it reliably on purpose. Recall the spine, then take the quiz.

The throughline

  • Monitoring watches for failures you predicted (known-unknowns) via pre-built dashboards/alerts; observability keeps telemetry rich and high-dimensional so you can ask new questions about unknown-unknowns. You need both.
  • Pillars: metrics (cheap numbers over time — keep labels low-cardinality), logs (timestamped events — structure them + add a correlation ID), traces (one request across all services, made of spans), and continuous profiling (which function burns resources). Correlate across them via shared IDs.
  • Cardinality is the cost that connects it all: high-cardinality values (user_id) in metric labels explode the TSDB — push that detail into logs/traces.
  • OpenTelemetry = instrument once against an open standard, swap backends freely. Three pieces: SDK/API, OTLP (wire protocol), Collector (the swappable seam). Traces need context propagation; control cost with sampling — prefer tail (keep all errors/outliers) over blind head sampling.
  • SLI (measured number) → SLO (internal target, tied to user journeys) → SLA (looser contractual promise). 100% is the wrong target. Error budget = 100% − SLO — a quantity you can spend; the error-budget policy balances shipping vs stability.
  • Alert on symptoms, not causes. Use RED (Rate/Errors/Duration) for services as SLIs, USE (Utilization/Saturation/Errors) for resource debugging; watch latency at percentiles. The modern page is burn-rate alerting; the enemy is alert fatigue (every page must be actionable). On-call should be humane and mostly quiet.
  • Incidents: detect → triage (severity) → single Incident Commander coordinates + comms → mitigate before diagnosing → resolve → blameless postmortem (systems not scapegoats; contributing factors not one root cause) → owned action items + runbooks. SRE caps toil and governs velocity-vs-reliability with the error budget.

Quiz

Required checkpoint

Chapter 6 — Observability & SRE

Pass to unlock the Next button below

You can now see into a running system and operate it reliably — defining what "healthy" means numerically, instrumenting vendor-neutrally to observe it, alerting on what users feel, and learning from failure without blame. The next chapter takes the instinct to make good practice the default and turns it into a product: an internal platform that gives every app team this (and more) for free.

Next: Chapter 7: Platform Engineering & IDPs →