OpenTelemetry: instrument once, send anywhere
You know the pillars — metrics, logs, traces, profiles. The next question is brutally practical: how do you actually produce them, and how do you avoid chaining yourself to one vendor forever? For years the answer was "install your observability vendor's proprietary agent everywhere," which meant that switching vendors required re-instrumenting every service. The durable answer to that problem is OpenTelemetry. This lesson explains what it is, the three pieces you'll touch, and the cross-cutting concepts — context propagation and sampling — that make distributed tracing actually work.
The problem OpenTelemetry solves: instrumentation lock-in
To get telemetry out of your app, you have to instrument it — add code (or an agent) that emits spans, metrics, and logs. Historically every vendor had their own SDK and agent. Instrument with Vendor A's library, and your telemetry speaks Vendor A's dialect, hard-wired into every service. Want to switch to Vendor B because they're cheaper or better? Rip out and replace instrumentation across your entire codebase. That's vendor lock-in at the instrumentation layer, and it's a trap teams fall into constantly.
OpenTelemetry (OTel) breaks it. It's an open, vendor-neutral standard — a set of APIs, SDKs, and a wire protocol — for generating and exporting telemetry. The core promise:
:::tip The durable idea: separate producing telemetry from where it goes With OpenTelemetry you instrument your code once, against an open standard, and decide separately and later which backend to send the data to — Datadog, Grafana, Honeycomb, New Relic, an open-source stack, several at once. Producing telemetry is decoupled from consuming it. The backend becomes a swappable choice, not a rewrite. Instrumenting with OTel instead of a proprietary agent is how you avoid lock-in — and it's now the industry-standard, second only to Kubernetes among graduated CNCF projects. :::
The three pieces you'll touch
1. The SDK (and API) — generating telemetry in your app
The OpenTelemetry API is the vendor-neutral interface your code calls to create spans, record metrics, and emit logs. The SDK is the implementation that does the work and exports the data. There are SDKs for every major language (Go, Java, Python, JavaScript, .NET, Rust…). You use them two ways:
- Manual instrumentation — you write code:
tracer.start_span("charge_card"), add attributes, record durations. Maximum control, more effort. - Auto-instrumentation — a library/agent automatically wraps common frameworks (your web server, HTTP client, database driver) and emits spans without you writing tracing code. The fast on-ramp; you get traces of your inbound requests and outbound calls for free.
2. OTLP — the standard wire protocol
OTLP (OpenTelemetry Protocol) is the standard format and protocol OTel uses to transmit telemetry. Because it's standardized, your app, the Collector, and backends all speak the same language. "Emit OTLP" is the lingua franca that makes the swap-the-backend promise real.
3. The Collector — the swappable seam in the middle
The OpenTelemetry Collector is a standalone service that sits between your apps and your backends. Your apps send telemetry (over OTLP) to the Collector; the Collector receives, processes, and exports it onward. It's the piece that makes vendor-neutrality operationally real:
- Receivers take telemetry in (OTLP and many other formats).
- Processors transform it in flight — batch it, drop noisy attributes, sample traces, and crucially strip high-cardinality labels before they hit a metrics backend (the cardinality control from lesson 6.2, enforced centrally).
- Exporters send it out to one or more backends.
Want to change backends? Reconfigure the Collector's exporter — one config change — and not a line of application code moves. That's the seam doing its job.
:::note Where the named tools fit — one sentence each You'll hear a zoo of names; here's the map, with when/why you'd reach for each rather than just a list.
Metrics stores: Prometheus is the de-facto standard time-series database and scraper for cloud-native metrics — start here. VictoriaMetrics is a faster, more storage-efficient drop-in often chosen when Prometheus's resource use hurts.
Logs: Loki is the lightweight, Prometheus-style log store (indexes labels, not full text) you pick when you want cheap logs that join cleanly to your metrics; Elastic/ELK is the heavyweight full-text search stack for when you need rich log querying and don't mind the cost.
Traces: Tempo is the cheap, object-storage-backed trace store designed to pair with Grafana; Jaeger is the long-standing CNCF tracing backend with its own UI, a common standalone choice.
Profiles: Pyroscope (Grafana) and Parca store the continuous-profiling data from lesson 6.2.
Zero-code collection: eBPF/Pixie (and Grafana Beyla) sit at the kernel to produce metrics and traces with no app changes — reach for them for instant, broad coverage you don't have to instrument.
Query / pane of glass: Grafana is the open-source dashboarding layer that unifies metrics, logs, and traces; Kibana is the equivalent over the Elastic stack.
All-in-one commercial (buy, don't assemble): Datadog and New Relic are full suites covering all pillars; Honeycomb specializes in high-cardinality, wide-event debugging and exploratory querying; Sentry focuses on error/exception tracking. You pay to skip operating the stack yourself.
The point of OpenTelemetry is that your instrumentation doesn't care which of these you pick — you emit OTLP and route it wherever. :::
The scaling tier: long-term storage for metrics at scale
There's a gap the list above glosses over, and it bites every team that grows. A single Prometheus server is brilliant but has two hard limits: it's a single node (one machine's CPU, memory, and disk cap how many series it can hold), and it keeps only local, short-term retention (weeks, not years — its storage isn't built to be your durable archive). So the moment you have many clusters, high cardinality, or a need to query last year's data for capacity planning or an SLO audit, a lone Prometheus simply can't hold it.
That specific problem — scale Prometheus horizontally and store its metrics durably for the long term — is what a whole tier of tools exists to solve. They sit behind Prometheus, ingesting its data and fronting it with a single, scalable, long-retention query layer backed by cheap object storage (S3/GCS):
- Thanos — bolts onto existing Prometheus servers (via a sidecar), giving you a global query view across many Prometheis plus unlimited retention in object storage. Pick it when you already run Prometheus everywhere and want to federate and archive them.
- Mimir (Grafana) — a from-the-ground-up, horizontally-scalable Prometheus-compatible backend built for very high cardinality and huge series counts; pick it for a single massively-scalable metrics store.
- Cortex — the older CNCF multi-tenant, horizontally-scalable Prometheus backend that Mimir descends from; still seen in multi-tenant platform setups.
- VictoriaMetrics — doubles as this tier: its clustered mode is a scalable, long-retention metrics store as well as a Prometheus replacement.
:::tip The durable idea, not the tool The lesson that outlives all four names: Prometheus is single-node with short retention by design, so at scale you put a horizontally-scalable, object-storage-backed long-term store behind it. Whether that store is Thanos, Mimir, Cortex, or VictoriaMetrics is a dated, swappable choice; needing a scaling-and-retention tier once one Prometheus stops fitting is the durable architectural fact. :::
Making distributed tracing real: context propagation
A trace spans many services (lesson 6.2). For their spans to be stitched into one tree, every service must agree on the same trace. That's trace context propagation: when service A calls service B, it passes the trace ID and its current span ID along — by convention in HTTP request headers (the W3C traceparent header is the standard). Service B reads those headers, makes its spans children of A's span, and passes the context on to whatever it calls. Lose propagation at any hop — a service that doesn't forward the header — and the trace breaks into disconnected fragments. OTel's auto-instrumentation handles propagation for common frameworks automatically, which is a big reason to use it.
Not storing everything: sampling
A busy system produces millions of traces. Storing them all is expensive and mostly pointless — most requests are boring successes. Sampling keeps a representative or interesting subset. There are two fundamentally different strategies, and choosing wrong is a classic gap:
- Head sampling — decide at the start of a request whether to keep its trace, before you know how it turns out (e.g. "keep 1%"). Cheap and simple, but blind: it can't keep the rare error trace because it decided before the error happened. Set the rate low and you'll routinely drop the very traces you needed.
- Tail sampling — buffer the trace until it completes, then decide using the outcome: keep all errors and slow requests, sample down the boring fast successes. Far smarter — you keep what matters — but it requires holding traces in memory until they finish (often done in the Collector), which costs more resources.
:::danger Decide your sampling strategy on purpose The two failure modes are symmetric. Sample everything and your tracing bill is enormous. Sample naively at the head and you drop the error and outlier traces — so when an incident hits, the trace you desperately want was thrown away. The durable answer for most teams is tail-based sampling that always keeps errors and high-latency traces and downsamples the rest. "No sampling strategy" is not a neutral default — it silently lands you in one of the two ditches. :::
Zero-code observability: eBPF auto-instrumentation
Even auto-instrumentation libraries require adding a dependency or agent to each app. A newer approach goes further: eBPF. eBPF is a Linux kernel technology that safely runs sandboxed programs inside the kernel, where it can observe network calls and function timing from outside the application process entirely. Tools like Grafana Beyla and Pixie use eBPF to produce metrics and traces with zero code changes and no redeploy — you don't touch the app at all. The trade-off: it sees the system from the kernel's vantage point (great for RED-style request metrics and service maps) but has less insight into in-app business context than SDK instrumentation. In practice teams combine them: eBPF for broad zero-effort coverage, OTel SDK for the deep, business-aware spans that matter.
Observability for AI/LLM workloads
A fast-growing 2026 frontier: applications built on large language models need their own telemetry. Beyond ordinary latency and errors, you want to track tokens consumed (input/output), cost per request (tokens × price), time-to-first-token, model and prompt version, cache hit rates, and quality signals. OpenTelemetry has emerging semantic conventions for generative-AI so these become standardized span attributes rather than ad-hoc fields — meaning you instrument LLM calls the same vendor-neutral way you instrument a database call. If you're operating anything LLM-powered, treat tokens-latency-cost as first-class telemetry, not an afterthought. (The economics side of this returns in Chapter 10, MLOps/LLMOps.)
Why it matters
Historically, instrumenting with a vendor's proprietary agent locked your telemetry to that vendor — switching meant re-instrumenting everything. OpenTelemetry ends that: instrument once against an open standard, then choose your backend separately and swap it freely. You'll touch three pieces — the SDK/API (generate telemetry, manually or via auto-instrumentation), OTLP (the standard wire protocol), and the Collector (the swappable seam that receives, processes, samples, strips cardinality, and exports). Distributed tracing only works if you propagate trace context across every hop, and you keep costs sane with a deliberate sampling strategy — ideally tail-based, always keeping errors and outliers. eBPF tools push instrumentation to zero code, and LLM workloads get first-class tokens-latency-cost telemetry. With telemetry flowing, the next question is what to do with it — starting with defining what "reliable enough" even means.