The pillars: metrics, logs, traces (and profiling)
Observability needs raw material — the data a system emits about itself. That data is called telemetry, and it comes in a few distinct shapes, traditionally called the three pillars: metrics, logs, and traces. A fourth — continuous profiling — has joined them. Each answers a different question and has a different cost profile, and a real skill of this field is reaching for the right one. This lesson defines all four from scratch, shows how they interlock, and introduces the cost that connects them: cardinality.
Metrics: numbers over time
A metric is a numeric measurement of some aspect of your system, sampled repeatedly over time — a number with a timestamp, recorded again and again. "Requests per second," "error count," "p99 latency in milliseconds," "memory used." Plot any of them and you get the familiar wiggly line on a dashboard.
Metrics are stored in a time-series database (TSDB) — a database specialized for "value at timestamp" data. They're wonderfully cheap and compact: a number every few seconds compresses tiny, so you can keep months of history and compute averages and percentiles fast. That makes them ideal for:
- Dashboards and trends — is latency creeping up over the week?
- Alerting — fire when the error rate crosses a threshold (or, better, when an SLO is burning — lesson 6.5).
Their limitation is the flip side of their cheapness: a metric is just a number. "Errors jumped to 500/sec" tells you that something's wrong, but contains zero detail about which requests or why. For that you need the other pillars.
Labels and the cardinality trap
Metrics carry labels (also called dimensions or tags) — key-value attributes that let you slice them: http_requests_total{method="GET", status="200", route="/checkout"}. Each unique combination of label values is a separate time series that the TSDB must store.
Here is the single most important cost rule in all of metrics:
:::danger Never put high-cardinality values in metric labels
Every distinct combination of label values creates a new time series. Add a label like user_id or request_id and you don't get one extra series — you get one per user or one per request, potentially millions. This is cardinality explosion, and it's the classic way to blow up your time-series database's storage and your bill (a "cardinality bomb"). Keep metric labels low-cardinality (method, status code, route template, region). When you need per-user or per-request detail, that belongs in logs or traces — not metric labels.
:::
This is the cardinality tension from lesson 6.1 made concrete: observability wants fine detail, but metrics are exactly the pillar where fine detail is ruinously expensive. The art is putting low-cardinality counters in metrics and pushing high-cardinality detail into the pillars built for it.
Logs: timestamped event records
A log is a timestamped, immutable record of a discrete event — "this happened, at this time." 2026-06-24T10:03:11Z ERROR payment failed for order 8841: card declined. Logs are the oldest and most intuitive form of telemetry; every program can print one.
The make-or-break choice with logs is structure:
- Unstructured logs are free-form text lines. Human-readable, but a nightmare to query at scale — finding "all declined-card errors for orders over $100" means fragile text-parsing.
- Structured logs are emitted as machine-parseable key-value data, usually JSON:
{"level":"error","event":"payment_failed","order_id":8841,"reason":"card_declined","amount":142.00,"trace_id":"a1b2c3"}. Now you can filter and aggregate by any field precisely.
:::tip Always emit structured logs with a correlation ID
Two habits separate logs that help from logs that don't. First, structure them (JSON, consistent field names) so they're queryable. Second, include a correlation ID — the same trace_id on every log line for a given request — so you can pull all logs across all services for one user's failed checkout. Without correlation IDs, logs from a dozen services are an un-joinable pile. With them, logs become a pillar that snaps cleanly onto traces.
:::
Logs carry rich, high-cardinality detail (full order IDs, error messages, payloads) — exactly what metrics can't afford. Their cost is the opposite: logging everything at high volume is expensive in storage and ingestion. The discipline is to log meaningful events at sensible levels, structured, not to firehose every line of execution.
Traces: the path of one request
In a system of one service, a log timeline tells the story. But modern requests fan out across many services — checkout calls auth, calls inventory, calls payments, calls a database — and the slow or failing piece could be any of them. A trace is the end-to-end record of one request's journey across all the services it touches.
A trace is made of spans. A span is a single unit of work — one service handling its part of the request — with a start time, an end time, attributes, and a parent. The first span (the overall request) is the root; each downstream call becomes a child span. Stitched together by a shared trace ID, the spans form a tree that shows where the time went and where it broke:
At a glance: the request took 1.4s, and almost all of it was the payment service waiting on a slow database query. That's the power of tracing — it pinpoints which service and which operation is responsible, something neither a metric (just a number) nor scattered logs (no cross-service shape) shows you directly. How spans get linked across service boundaries (context propagation) and how you avoid storing every trace (sampling) are covered in the next lesson.
The fourth pillar: continuous profiling
Tracing tells you which service is slow. Continuous profiling answers the next question: which line of code inside it. A profiler samples, many times a second, exactly where a running program is spending CPU (or allocating memory) — which functions are on the stack. Continuous profiling does this always, in production, at low overhead, so when a service is burning CPU you can see the exact function responsible — not in a lab, but on the real running system. Tools like Pyroscope (now part of Grafana) and Parca popularized it; it's increasingly counted as the fourth pillar precisely because traces stop at the service boundary and profiling goes inside.
When to reach for which
The pillars aren't competitors; they answer different questions, and you correlate across them:
| Question | Pillar |
|---|---|
| Is the system healthy? Is a number trending badly? | Metrics (cheap, alertable) |
| What exactly happened in this event? What was the error? | Logs (rich, structured, per-event) |
| Where, across all my services, did this request slow down or fail? | Traces (cross-service path) |
| Which function/line is burning CPU or memory? | Profiling (inside one service) |
The real workflow chains them: a metric alert fires (latency SLO burning) → you open a trace of a slow request and see the payment service is the culprit → you read the logs for that trace's trace_id and find "card processor timeout" → a profile confirms the retry loop is pegging CPU. Each pillar hands off to the next via shared IDs. Making that hand-off seamless — one standard that emits all of them with consistent context — is the job of OpenTelemetry, next.
Common pitfalls
- Putting high-cardinality values in metric labels.
user_id/request_idas labels is the cardinality bomb — it belongs in logs/traces. (The single most common metrics mistake.) - Unstructured logs with no correlation ID. Free-form text from a dozen services that can't be joined; you can't reconstruct one request's story.
- No tracing in a microservices system. Without traces, "checkout is slow" is an unsolvable whodunit across a dozen services.
- Treating the pillars as either/or. They're complementary; the value is in correlating them, not picking one.
Why it matters
Telemetry comes in distinct shapes. Metrics are cheap numbers over time — perfect for dashboards and alerts, but you must keep their labels low-cardinality or you'll explode your TSDB. Logs are timestamped event records — richest when structured (JSON) and stamped with a correlation ID so they join across services. Traces follow one request across every service it touches, made of spans, pinpointing where time was spent or lost. Continuous profiling goes one level deeper, showing which function burns resources in production. None replaces the others; the skill is reaching for the right pillar and correlating across them via shared IDs. Next: how to emit all of them, vendor-neutrally, with one standard.