Distributed Tracing: A Practical Guide

What Distributed Tracing Is

Distributed tracing captures the complete journey of a single request as it passes through multiple services. A trace is composed of spans — each span represents a discrete unit of work (an HTTP call, a database query, a queue publish). Spans carry a shared Trace ID (128-bit, globally unique) and parent-child relationships, so you can reconstruct the full causal chain: “this API call triggered that service call, which triggered that database query, which took 800ms and is why the user saw a 2-second response.”

The mechanism is context propagation — trace context (Trace ID, Span ID, sampling flags) travels across service boundaries via HTTP headers (traceparent / W3C Trace Context standard), gRPC metadata, or message queue headers.

What It Does

Latency analysis: Which service or operation is the bottleneck in a multi-service request
Error propagation: Where an error originated and how it cascaded through services
Dependency mapping: Which services call which, and how deeply
Root cause analysis: From “something is slow” to “this specific database query in service X is slow when called by service Y”

What It Does Not Do Well

Aggregate system health — metrics are better for “is the system healthy right now?”
Deep context on a single event — logs are better for “what exactly went wrong in this function?”
Long-running processes — traces are designed for request/response patterns; batch jobs and stream processing are awkward to model
Complete coverage — in practice, not every service is instrumented, not every library supports propagation, gaps in traces are common
Business analytics — traces capture technical operations, not business events

The Sampling Problem

This is the fundamental tension. Capturing every request is not feasible at scale (Alibaba generates ~20 PB of trace data per day). So you sample, and every sampling strategy has tradeoffs:

Head-based sampling (decide at entry): Simple, low-overhead, but the decision is made before knowing if the request will be interesting. Errors and anomalies in the unsampled majority are invisible.
Tail-based sampling (decide after trace completes): Can keep all errors and slow traces. But requires buffering all spans in memory before the decision — expensive, and hard to determine when a trace is “complete” since spans arrive out of order.

The paradox: the traces you most need to see (errors, rare edge cases) are exactly the ones most likely to be dropped by sampling.

Tracing vs Metrics vs Logs vs Events

Signal	Question	Data Type	Cost	Best For
Metrics	What is happening?	Numeric, aggregated	Low	Alerting, dashboards, SLOs, capacity planning
Logs	Why did it happen?	Text/structured records	High	Detailed debugging, audit trails, compliance
Traces	Where did it happen?	Spans & relationships	Medium	Cross-service latency, dependency mapping, error propagation
Events	What changed?	Structured occurrences	Varies	Deployment tracking, config changes, incident correlation

The diagnostic workflow: Metrics surface the symptom (error rate spike). Traces narrow it to a specific operation (the database call in service X). Logs explain why (the specific SQL error, the malformed input). Events provide context (a deployment happened 5 minutes before the spike).

None is sufficient alone. Correlation across all four is the goal. OpenTelemetry unifies the first three under one framework with shared context (Trace ID, resource attributes). The Grafana stack (Loki + Tempo + Mimir + Grafana) and Datadog both implement cross-signal correlation in the UI.

How Traces Get Produced

Traces don’t appear by magic. Something has to produce the spans, and something has to collect them. There are three levels of instrumentation, each producing spans at different depths.

Level 1: Infrastructure (Mesh/Proxy)

Service meshes like Istio use Envoy sidecar proxies that sit in the data path and generate spans for every network hop. Entry/exit spans at service boundaries — HTTP method, status code, latency, upstream/downstream service names. No code changes required.

But Envoy is a proxy. It sees network traffic. It does not see what happens inside the application between receiving a request and sending one out. It observes the envelope, not the letter.

Critical caveat: Envoy generates spans per hop, but it cannot correlate an outbound request to the inbound request that caused it. The application must propagate trace headers from incoming to outgoing requests. Without this, you get isolated per-hop spans, not connected end-to-end traces. This is the single most common problem teams encounter when setting up tracing in a service mesh.

Level 2: Auto-Instrumentation (OTel SDK, Zero-Code)

OpenTelemetry provides auto-instrumentation agents that hook into well-known frameworks and libraries at runtime. In Java, it’s a -javaagent JVM flag. In Python, it’s opentelemetry-instrument wrapping the process. In Go, it’s compile-time instrumentation libraries (Go doesn’t have a runtime agent model).

What auto-instrumentation captures depends on the language and libraries in use, but typically:

HTTP servers/clients (net/http, Express, Spring, Flask) — spans for every inbound and outbound HTTP request
gRPC — spans for every RPC call
Database drivers (JDBC, pgx, database/sql) — spans for every query, including the SQL statement
Message queues (Kafka producers/consumers, RabbitMQ) — spans for publish and consume, with context propagation through message headers
Redis, Memcached — spans for cache operations
AWS SDK calls — spans for S3, DynamoDB, SQS, etc.

This is where most of the practical value comes from. Auto-instrumentation covers the common I/O boundaries without touching application code. It also handles context propagation automatically — the trace headers get forwarded from incoming to outgoing requests, which is exactly what the mesh layer alone does not do.

Level 3: Manual Instrumentation (Custom Spans)

For anything auto-instrumentation doesn’t cover — business logic, internal algorithms, conditional branches, custom processing steps — there’s code to write:

ctx, span := tracer.Start(ctx, "processOrder",
    trace.WithAttributes(
        attribute.String("order.id", orderID),
        attribute.Int("order.items", len(items)),
    ),
)
defer span.End()

// ... business logic ...

if err != nil {
    span.RecordError(err)
    span.SetStatus(codes.Error, "order processing failed")
}

This is the only way to get:

Business context (order ID, user ID, tenant ID) as span attributes
Visibility into application-internal operations
Custom error recording with domain-specific detail
Spans around logic that doesn’t touch I/O (validation, transformation, computation)

The Practical Stack

In most production setups, all three layers combine:

[Mesh/Proxy spans]     ← Envoy, service boundary, automatic
        +
[Auto-instrumented]    ← OTel SDK, library-level I/O, near-automatic
        +
[Manual spans]         ← Application code, business logic, requires effort
        =
[Complete trace]       ← The full picture from ingress to database and back

Level	Effort	Coverage
Mesh only (Envoy)	Zero code changes	Service boundary hops only. No internal visibility. Broken traces without header propagation.
Auto-instrumentation	SDK dependency + agent flag + env vars	HTTP, gRPC, database, cache, queue spans. Header propagation handled. Solid coverage for most services.
Manual instrumentation	Code per operation	Business logic, custom attributes, full internal visibility. The only way to get domain-specific context.

Most teams start with auto-instrumentation (80% of the value for 5% of the effort) and add manual spans selectively where deeper visibility is needed.

How Traces Get Collected

Once spans are produced, they need to reach a backend. The collection pipeline:

Application (OTel SDK)
      │
      │ OTLP (gRPC :4317 or HTTP :4318)
      ▼
OTel Collector (agent mode, DaemonSet or sidecar)
      │
      │ batch, filter, enrich (add k8s metadata, etc.)
      ▼
OTel Collector (gateway mode, optional)
      │
      │ tail sampling, routing, fan-out
      ▼
Backend (Tempo, Jaeger, Datadog, etc.)

Export from the Application

The OTel SDK batches completed spans in memory and exports them periodically (default: every 5 seconds or when the batch hits 512 spans). The export target is configured via environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_SERVICE_NAME=checkout-service
OTEL_RESOURCE_ATTRIBUTES=k8s.namespace.name=prod,k8s.pod.name=$(POD_NAME)

The SDK handles batching, retry on failure, and back-pressure (drops spans if the export queue is full, rather than blocking the application).

Collector Agent (DaemonSet)

In Kubernetes, the typical pattern is an OTel Collector DaemonSet — one per node. Every pod on that node exports to the local collector. The agent:

Receives spans over OTLP
Enriches them with Kubernetes metadata (pod name, namespace, node, labels) via the k8sattributes processor
Batches for efficient network transfer
Applies memory limits to prevent OOM
Forwards to a gateway or directly to the backend

Collector Gateway (Optional)

For larger deployments, a centralized gateway collector handles:

Tail-based sampling — must happen centrally because it needs to see the complete trace before deciding whether to keep it, and spans from different services land on different agent nodes
Routing — send traces to Tempo, metrics to Mimir, logs to Loki
Fan-out — send to multiple backends simultaneously (e.g., Tempo for storage + a real-time analytics pipeline)

Backend Ingestion

The backend receives spans over OTLP and writes them to storage. From here it’s the backend’s problem — Tempo writes Parquet blocks to S3, Jaeger indexes into Elasticsearch, etc.

In a service mesh deployment, the Envoy sidecar acts as its own span producer and exports spans directly to the tracing backend without going through the application’s OTel SDK. The application’s SDK-produced spans and Envoy’s proxy-produced spans share the same Trace ID (assuming header propagation is working), so the backend stitches them together into one trace.

OpenTelemetry (OTel)

The de facto industry standard as of 2025. Second most active CNCF project after Kubernetes. All three core signals (traces, metrics, logs) are now stable. Semantic Conventions 1.0 shipped in 2025, standardizing attribute names across all languages and exporters.

Signal Maturity

Signal	Status	Notes
Traces	Stable	SDKs are v1.0+ across major languages
Metrics	Stable	Data model released as part of OTLP
Logs	Stable	Log Bridge API for existing frameworks
Profiling	In Development	Will support bi-directional links with traces/metrics/logs

Key Components

SDKs for 12+ languages (Java, Go, Python, JavaScript/TypeScript, .NET, Rust, C++, Ruby, PHP, Swift, Kotlin, Erlang) with auto-instrumentation for common frameworks
OTLP protocol (gRPC/HTTP + protobuf) — the standard wire format everything speaks
Collector — vendor-agnostic pipeline: Receivers → Processors → Exporters

Collector Architecture

[Receivers] --> [Processors] --> [Exporters]

Receivers (data ingress): OTLP (gRPC/HTTP), Prometheus scrape, Kafka, Jaeger, Zipkin, Fluent Forward, and many more.

Processors (transformation): Batch (groups telemetry for efficient export), Memory Limiter (prevents OOM), Attributes (add/modify/delete span attributes), Filter (drop unwanted data), Tail Sampling (sampling decisions after seeing complete traces).

Exporters (data egress): OTLP (to any compatible backend), Prometheus Remote Write, Debug (stdout), plus backend-specific exporters for Jaeger, Zipkin, Datadog, New Relic, etc.

The instrumentation layer is settled. The choice is in the backend.

Jaeger

History and Current State

Open-sourced by Uber in 2015, CNCF graduated project. Jaeger v2 shipped November 2024 (current: v2.13). Jaeger v1 reaches end-of-life December 31, 2025.

The v2 Rewrite

The defining change: Jaeger v2 is built on top of the OpenTelemetry Collector framework. The Jaeger binary directly imports OTel Collector code as a library. It’s not a fork — it’s a customized OTel Collector distribution with Jaeger’s storage backends and UI.

What this means practically:

Single binary replaces multiple v1 binaries (collector, agent, query, ingester)
Configured via OTel Collector YAML format
Natively processes OTLP — no translation step
Gets tail-based sampling, batch processing, filtering, and every other OTel Collector processor for free
A separate OTel Collector in front of Jaeger is no longer necessary (though it can still be used)

Deployment Roles

The single binary runs in different roles:

Collector: Receives trace data, writes to storage
Query: Serves APIs and the Jaeger UI for querying/visualizing traces
Ingester: Consumes from Kafka, writes to storage
All-in-one: Collector + Query in a single process (development/testing)

Storage Backends

Backend	Notes
Elasticsearch 7.x/8.x	Best query performance. Recommended for most deployments.
OpenSearch 1.0+	Drop-in Elasticsearch alternative
Cassandra 4.0+	Good for write-heavy workloads, limited analytics
ClickHouse	Becoming first-class. Column-oriented, superior for analytics on trace data.
Kafka	Buffering layer for durability and spike absorption, not storage itself

Strengths

Mature, battle-tested at Uber scale
Built-in UI for trace visualization
Rich indexed queries (search by service, operation, tags, duration)
Fully aligned with OTel — no more divergent instrumentation formats

Weaknesses

Operational cost: Running Elasticsearch or Cassandra clusters is non-trivial
Storage backends require their own capacity planning, scaling, and backup
Indexing everything is expensive — pushes teams toward aggressive sampling (1-10%)

Grafana Tempo

Philosophy

Fundamentally different design from Jaeger: no indexing, object storage only.

Tempo stores traces as Parquet blocks in S3/GCS/Azure Blob. No Elasticsearch. No Cassandra. No database to operate. Object storage is cheap enough to store 100% of traces without sampling.

Architecture

Distributor: Accepts spans (OTLP, Jaeger, Zipkin protocols), routes to ingesters via consistent hash ring
Ingester: Buffers spans, builds Parquet blocks, flushes to object storage
Querier: Looks up traces in ingesters (recent) or object storage (historical)
Query Frontend: Splits queries across queriers for parallelism
Compactor: Compresses, deduplicates, expires blocks

Storage Backends

Amazon S3
Google Cloud Storage (GCS)
Azure Blob Storage
MinIO (S3-compatible)
Local filesystem (development only)

TraceQL

Originally Tempo was trace-ID-lookup only — finding a trace required knowing its ID, which meant discovering traces through correlated logs (Loki) or metric exemplars (Prometheus/Mimir).

That’s no longer the case. TraceQL is Tempo’s query language:

{ span.http.status_code = 500 }
{ span.http.method = "GET" && duration > 2s }
{ resource.service.name = "checkout" && span.db.system = "postgresql" }

TraceQL Metrics (public preview) can create aggregate metrics from traces, similar to how LogQL creates metrics from logs.

Current Versions

Tempo 2.8 (June 2025): New TraceQL functions, memory optimizations
Tempo 2.9 (October 2025): MCP server support (LLMs can query traces via TraceQL), TraceQL metrics sampling

The Grafana Stack Integration

Tempo’s real power is in the integrated stack:

Loki (logs) → find trace IDs in log lines → jump to trace in Tempo
Tempo (traces) → span-level detail, TraceQL queries
Mimir/Prometheus (metrics) → exemplars link directly to trace IDs → jump to trace
Grafana → unified UI correlating all three signals

Jaeger vs Tempo: Head to Head

	Jaeger	Tempo
Storage	Elasticsearch, Cassandra, ClickHouse	Object storage (S3, GCS, Azure)
Indexing	Full indexing	No traditional indexing; Parquet blocks
Sampling	Typically 1-10% (storage cost pressure)	Designed for 100% (storage is cheap)
Query	Rich indexed search from day one	TraceQL (newer, catching up)
UI	Built-in Jaeger UI	Grafana
Operational burden	Higher (database clusters to manage)	Lower (object storage, no indexes)
Ecosystem	Standalone / OTel	Deep Grafana stack integration
Cost at scale	Higher (indexed storage is expensive)	Lower (object storage is cheap)

Migration trend: Red Hat published Jaeger-to-Tempo migration guidance in April 2025 as OpenShift deprecated the Jaeger-based tracing platform. Both accept OTLP, so the data pipeline doesn’t change — only the backend.

Istio / Service Mesh Integration

Service meshes provide tracing with near-zero application changes, but the coverage has important limits.

What the Mesh Provides Automatically

Envoy sidecars generate spans for every inbound/outbound request at the service boundary
Latency measurement per hop
Service dependency graph
Sampling rate controlled via Telemetry API (default 1%)

What Requires Manual Work

End-to-end trace correlation — Envoy generates spans per hop, but cannot correlate an outbound request to the inbound request that caused it. The application must propagate trace headers (traceparent/tracestate for W3C, or x-b3-* for B3 format) from incoming to outgoing requests. Without this, traces appear as disconnected per-hop spans.
Application-internal spans — anything inside the code (function calls, business logic) requires SDK instrumentation
Database/cache query tracing — requires library-level instrumentation
Business context (user ID, order ID) — requires custom span attributes via SDK

Capability	Automatic (mesh)	Manual (application)
Span generation at service boundary	Yes	—
Latency measurement per hop	Yes	—
Service dependency graph	Yes	—
End-to-end trace correlation	No	Header propagation required
Application-internal spans	No	Full SDK instrumentation
Business context attributes	No	Custom attributes via SDK
Database query tracing	No	Library instrumentation

The recommended approach for header propagation is OTel SDK auto-instrumentation, which handles it transparently. The alternative is manual middleware that copies headers from incoming to outgoing requests.

Other Tracing Systems

Zipkin: The original (Twitter, 2012). Simple, lightweight. Its B3 propagation format was the standard before W3C Trace Context. Hasn’t evolved for modern scale. Increasingly superseded by Jaeger and Tempo.
AWS X-Ray: Managed service, deep AWS integration (Lambda, ECS, API Gateway). Good for AWS-only shops. Limited outside the ecosystem. Supports OTel for instrumentation.
Datadog APM: Commercial SaaS. Automatic instrumentation, AI anomaly detection (Watchdog), tightly integrated traces/logs/metrics. Pricing escalates fast (~$31/host/month + per-GB ingestion charges).
Honeycomb: Purpose-built for high-cardinality trace debugging. “BubbleUp” feature for automatic anomaly detection. OTel-native. Best for teams that prioritize deep trace-based debugging over breadth.
Lightstep / ServiceNow Cloud Observability: Founded by OTel co-creators, acquired by ServiceNow. ServiceNow announced EOL for Lightstep in 2025. Teams are migrating to Grafana stack, Honeycomb, or SigNoz.