OpenTelemetry is the boring tool I love

Why I install it on day one, every time.

There are tools you pick after a long comparison and tools you pick without thinking. OpenTelemetry is in the second bucket for me.

It isn't exciting. Nobody wants to demo it. Every service I've owned has had it wired in by the end of the first week, and I've never regretted that choice.

What it is

OpenTelemetry is a CNCF project that standardizes how applications emit telemetry (traces, metrics, logs) in a vendor-neutral format. Two parts are worth caring about: an API and SDK you use in your code to create spans and record metrics, and a separate collector process that receives the data and forwards it to whatever backend you use.

The wire format is OTLP, which most observability vendors accept directly: Honeycomb, Grafana Tempo, Datadog, New Relic, Jaeger. If one of them turns out to be the wrong choice for you later, you swap the collector's exporter and keep every line of instrumentation you've already written. That property is the reason to use it.

Why day one

Retrofitting observability is expensive, not in CPU but in attention. By the time you need traces you're usually in an incident, and the parts of the code you need to instrument are the parts you're trying hardest not to touch.

Installing OpenTelemetry while the service is still small changes that math. Auto-instrumentation covers the common cases for free; the Python, Java, and Node SDKs ship hooks that wrap HTTP clients, database drivers, and web frameworks, so most useful signals show up without writing a single span by hand. The spans you do write are cheap. with tracer.start_as_current_span("validate_payload") is one line and zero cognitive overhead at the time you add it. And it avoids the "we'll add telemetry later" pattern, which in practice means "we added it during the first bad incident, poorly."

On services I work on today, traces routinely catch problems I would never have guessed at. A third-party gRPC client doing a synchronous DNS lookup on every call, a Redis read that turned into a per-request N+1 after some harmless-looking refactor, a JSON encode blocking the event loop on a hot path. None of it was code I'd written recently, and all of it was visible within ten minutes of reading a flame graph.

A minimal setup

For a Python service, a working install is roughly four things:

Add the SDK and instrumentation packages for your framework and drivers (opentelemetry-sdk, opentelemetry-instrumentation-fastapi, plus the client libraries you use).
Configure the OTLP exporter to point at a local collector (OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317).
Run the OpenTelemetry Collector next to the service: a sidecar, a DaemonSet, or a plain process on the host.
Point the collector at the backend of your choice.

That's the whole setup. No vendor agent and no SaaS account required to start.

The collector is the part most teams under-invest in. It's where you sample, redact sensitive attributes, batch exports, and hedge against backend outages. Running it as a separate process also means the service isn't shipping telemetry directly over the internet, which matters for reliability and for compliance.

The parts that matter

The semantic conventions are the obvious one. Standard attribute names for HTTP, databases, messaging, and cloud providers (http.method, db.system, messaging.destination) mean the same query works across every service in the fleet. That's the payoff when an organization adopts OpenTelemetry as a whole rather than each team inventing its own field names.

Context propagation matters almost as much. A request entering a service arrives with a traceparent header, the service continues the trace, and downstream calls inherit it. End-to-end traces across service boundaries with no additional plumbing, provided every service is instrumented.

The thing I keep coming back to is tail-based sampling in the collector. Keep every slow request, drop most fast ones. That's the right default for a production service, and it's a collector config change rather than a code change.

What it doesn't solve

OpenTelemetry is a data-plane standard. It won't tell you what to alert on or which questions to ask during an incident. If a team has no clear opinions about its SLIs, OpenTelemetry won't rescue the post-mortem on its own.

The logging side is also less mature than traces and metrics. For a new service today I still route structured logs through the language's standard logger and let the collector pick them up, rather than using the OpenTelemetry Logs API directly. That'll probably change over the next few releases.

Closing

OpenTelemetry is the kind of tool that goes invisible once it's working. Install it on day one and spend an afternoon getting the collector right. The receipts will be there when the first bad incident shows up.

← back to writing