View on GitHub

nikogura.com

Thoughts, opinions, and occasionally rantings of a passionate technologist.

Distributed Tracing: A Practical Guide

What Distributed Tracing Is

Distributed tracing captures the complete journey of a single request as it passes through multiple services. A trace is composed of spans — each span represents a discrete unit of work (an HTTP call, a database query, a queue publish). Spans carry a shared Trace ID (128-bit, globally unique) and parent-child relationships, so you can reconstruct the full causal chain: “this API call triggered that service call, which triggered that database query, which took 800ms and is why the user saw a 2-second response.”

The mechanism is context propagation — trace context (Trace ID, Span ID, sampling flags) travels across service boundaries via HTTP headers (traceparent / W3C Trace Context standard), gRPC metadata, or message queue headers.

What It Does

What It Does Not Do Well

The Sampling Problem

This is the fundamental tension. Capturing every request is not feasible at scale (Alibaba generates ~20 PB of trace data per day). So you sample, and every sampling strategy has tradeoffs:

The paradox: the traces you most need to see (errors, rare edge cases) are exactly the ones most likely to be dropped by sampling.


Tracing vs Metrics vs Logs vs Events

Signal Question Data Type Cost Best For
Metrics What is happening? Numeric, aggregated Low Alerting, dashboards, SLOs, capacity planning
Logs Why did it happen? Text/structured records High Detailed debugging, audit trails, compliance
Traces Where did it happen? Spans & relationships Medium Cross-service latency, dependency mapping, error propagation
Events What changed? Structured occurrences Varies Deployment tracking, config changes, incident correlation

The diagnostic workflow: Metrics surface the symptom (error rate spike). Traces narrow it to a specific operation (the database call in service X). Logs explain why (the specific SQL error, the malformed input). Events provide context (a deployment happened 5 minutes before the spike).

None is sufficient alone. Correlation across all four is the goal. OpenTelemetry unifies the first three under one framework with shared context (Trace ID, resource attributes). The Grafana stack (Loki + Tempo + Mimir + Grafana) and Datadog both implement cross-signal correlation in the UI.


How Traces Get Produced

Traces don’t appear by magic. Something has to produce the spans, and something has to collect them. There are three levels of instrumentation, each producing spans at different depths.

Level 1: Infrastructure (Mesh/Proxy)

Service meshes like Istio use Envoy sidecar proxies that sit in the data path and generate spans for every network hop. Entry/exit spans at service boundaries — HTTP method, status code, latency, upstream/downstream service names. No code changes required.

But Envoy is a proxy. It sees network traffic. It does not see what happens inside the application between receiving a request and sending one out. It observes the envelope, not the letter.

Critical caveat: Envoy generates spans per hop, but it cannot correlate an outbound request to the inbound request that caused it. The application must propagate trace headers from incoming to outgoing requests. Without this, you get isolated per-hop spans, not connected end-to-end traces. This is the single most common problem teams encounter when setting up tracing in a service mesh.

Level 2: Auto-Instrumentation (OTel SDK, Zero-Code)

OpenTelemetry provides auto-instrumentation agents that hook into well-known frameworks and libraries at runtime. In Java, it’s a -javaagent JVM flag. In Python, it’s opentelemetry-instrument wrapping the process. In Go, it’s compile-time instrumentation libraries (Go doesn’t have a runtime agent model).

What auto-instrumentation captures depends on the language and libraries in use, but typically:

This is where most of the practical value comes from. Auto-instrumentation covers the common I/O boundaries without touching application code. It also handles context propagation automatically — the trace headers get forwarded from incoming to outgoing requests, which is exactly what the mesh layer alone does not do.

Level 3: Manual Instrumentation (Custom Spans)

For anything auto-instrumentation doesn’t cover — business logic, internal algorithms, conditional branches, custom processing steps — there’s code to write:

ctx, span := tracer.Start(ctx, "processOrder",
    trace.WithAttributes(
        attribute.String("order.id", orderID),
        attribute.Int("order.items", len(items)),
    ),
)
defer span.End()

// ... business logic ...

if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, "order processing failed")
}

This is the only way to get:

The Practical Stack

In most production setups, all three layers combine:

[Mesh/Proxy spans]     ← Envoy, service boundary, automatic
        +
[Auto-instrumented]    ← OTel SDK, library-level I/O, near-automatic
        +
[Manual spans]         ← Application code, business logic, requires effort
        =
[Complete trace]       ← The full picture from ingress to database and back
Level Effort Coverage
Mesh only (Envoy) Zero code changes Service boundary hops only. No internal visibility. Broken traces without header propagation.
Auto-instrumentation SDK dependency + agent flag + env vars HTTP, gRPC, database, cache, queue spans. Header propagation handled. Solid coverage for most services.
Manual instrumentation Code per operation Business logic, custom attributes, full internal visibility. The only way to get domain-specific context.

Most teams start with auto-instrumentation (80% of the value for 5% of the effort) and add manual spans selectively where deeper visibility is needed.


How Traces Get Collected

Once spans are produced, they need to reach a backend. The collection pipeline:

Application (OTel SDK)
      │
      │ OTLP (gRPC :4317 or HTTP :4318)
      ▼
OTel Collector (agent mode, DaemonSet or sidecar)
      │
      │ batch, filter, enrich (add k8s metadata, etc.)
      ▼
OTel Collector (gateway mode, optional)
      │
      │ tail sampling, routing, fan-out
      ▼
Backend (Tempo, Jaeger, Datadog, etc.)

Export from the Application

The OTel SDK batches completed spans in memory and exports them periodically (default: every 5 seconds or when the batch hits 512 spans). The export target is configured via environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=checkout-service
OTEL_RESOURCE_ATTRIBUTES=k8s.namespace.name=prod,k8s.pod.name=$(POD_NAME)

The SDK handles batching, retry on failure, and back-pressure (drops spans if the export queue is full, rather than blocking the application).

Collector Agent (DaemonSet)

In Kubernetes, the typical pattern is an OTel Collector DaemonSet — one per node. Every pod on that node exports to the local collector. The agent:

Collector Gateway (Optional)

For larger deployments, a centralized gateway collector handles:

Backend Ingestion

The backend receives spans over OTLP and writes them to storage. From here it’s the backend’s problem — Tempo writes Parquet blocks to S3, Jaeger indexes into Elasticsearch, etc.

In a service mesh deployment, the Envoy sidecar acts as its own span producer and exports spans directly to the tracing backend without going through the application’s OTel SDK. The application’s SDK-produced spans and Envoy’s proxy-produced spans share the same Trace ID (assuming header propagation is working), so the backend stitches them together into one trace.


OpenTelemetry (OTel)

The de facto industry standard as of 2025. Second most active CNCF project after Kubernetes. All three core signals (traces, metrics, logs) are now stable. Semantic Conventions 1.0 shipped in 2025, standardizing attribute names across all languages and exporters.

Signal Maturity

Signal Status Notes
Traces Stable SDKs are v1.0+ across major languages
Metrics Stable Data model released as part of OTLP
Logs Stable Log Bridge API for existing frameworks
Profiling In Development Will support bi-directional links with traces/metrics/logs

Key Components

Collector Architecture

[Receivers] --> [Processors] --> [Exporters]

Receivers (data ingress): OTLP (gRPC/HTTP), Prometheus scrape, Kafka, Jaeger, Zipkin, Fluent Forward, and many more.

Processors (transformation): Batch (groups telemetry for efficient export), Memory Limiter (prevents OOM), Attributes (add/modify/delete span attributes), Filter (drop unwanted data), Tail Sampling (sampling decisions after seeing complete traces).

Exporters (data egress): OTLP (to any compatible backend), Prometheus Remote Write, Debug (stdout), plus backend-specific exporters for Jaeger, Zipkin, Datadog, New Relic, etc.

The instrumentation layer is settled. The choice is in the backend.


Jaeger

History and Current State

Open-sourced by Uber in 2015, CNCF graduated project. Jaeger v2 shipped November 2024 (current: v2.13). Jaeger v1 reaches end-of-life December 31, 2025.

The v2 Rewrite

The defining change: Jaeger v2 is built on top of the OpenTelemetry Collector framework. The Jaeger binary directly imports OTel Collector code as a library. It’s not a fork — it’s a customized OTel Collector distribution with Jaeger’s storage backends and UI.

What this means practically:

Deployment Roles

The single binary runs in different roles:

Storage Backends

Backend Notes
Elasticsearch 7.x/8.x Best query performance. Recommended for most deployments.
OpenSearch 1.0+ Drop-in Elasticsearch alternative
Cassandra 4.0+ Good for write-heavy workloads, limited analytics
ClickHouse Becoming first-class. Column-oriented, superior for analytics on trace data.
Kafka Buffering layer for durability and spike absorption, not storage itself

Strengths

Weaknesses


Grafana Tempo

Philosophy

Fundamentally different design from Jaeger: no indexing, object storage only.

Tempo stores traces as Parquet blocks in S3/GCS/Azure Blob. No Elasticsearch. No Cassandra. No database to operate. Object storage is cheap enough to store 100% of traces without sampling.

Architecture

Storage Backends

TraceQL

Originally Tempo was trace-ID-lookup only — finding a trace required knowing its ID, which meant discovering traces through correlated logs (Loki) or metric exemplars (Prometheus/Mimir).

That’s no longer the case. TraceQL is Tempo’s query language:

{ span.http.status_code = 500 }
{ span.http.method = "GET" && duration > 2s }
{ resource.service.name = "checkout" && span.db.system = "postgresql" }

TraceQL Metrics (public preview) can create aggregate metrics from traces, similar to how LogQL creates metrics from logs.

Current Versions

The Grafana Stack Integration

Tempo’s real power is in the integrated stack:


Jaeger vs Tempo: Head to Head

  Jaeger Tempo
Storage Elasticsearch, Cassandra, ClickHouse Object storage (S3, GCS, Azure)
Indexing Full indexing No traditional indexing; Parquet blocks
Sampling Typically 1-10% (storage cost pressure) Designed for 100% (storage is cheap)
Query Rich indexed search from day one TraceQL (newer, catching up)
UI Built-in Jaeger UI Grafana
Operational burden Higher (database clusters to manage) Lower (object storage, no indexes)
Ecosystem Standalone / OTel Deep Grafana stack integration
Cost at scale Higher (indexed storage is expensive) Lower (object storage is cheap)

Migration trend: Red Hat published Jaeger-to-Tempo migration guidance in April 2025 as OpenShift deprecated the Jaeger-based tracing platform. Both accept OTLP, so the data pipeline doesn’t change — only the backend.


Istio / Service Mesh Integration

Service meshes provide tracing with near-zero application changes, but the coverage has important limits.

What the Mesh Provides Automatically

What Requires Manual Work

Capability Automatic (mesh) Manual (application)
Span generation at service boundary Yes
Latency measurement per hop Yes
Service dependency graph Yes
End-to-end trace correlation No Header propagation required
Application-internal spans No Full SDK instrumentation
Business context attributes No Custom attributes via SDK
Database query tracing No Library instrumentation

The recommended approach for header propagation is OTel SDK auto-instrumentation, which handles it transparently. The alternative is manual middleware that copies headers from incoming to outgoing requests.


Other Tracing Systems