OpenTelemetry

OpenTelemetry

What is OpenTelemetry?

OpenTelemetry (OTel) is an open-source observability framework that provides a standardized way to collect, process, and export telemetry data (traces, metrics, and logs) from your applications and infrastructure.

Key Benefits:

  • Vendor-agnostic: No lock-in to specific observability platforms
  • Standardized instrumentation across languages and frameworks
  • Unified telemetry collection and processing
  • Strong community support and industry adoption

Core Concepts

┌─────────────────────────────────────────────────────────────────┐  
│                        Your Application                         │  
│                                                                 │  
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │  
│  │   Traces     │    │   Metrics    │    │     Logs     │       │  
│  └──────┬───────┘    └───────┬──────┘    └────────┬─────┘       │  
│         │                    │                    │             │  
└─────────┼────────────────────┼────────────────────┼─────────────┘  
          │                    │                    │  
          └────────────────────┼────────────────────┘  
                               │  
                               ▼  
                    ┌──────────────────────┐  
                    │  OTLP (Protocol)     │  
                    └──────────┬───────────┘  
                               │  
                               ▼  
                    ┌──────────────────────┐  
                    │  OTel Collector      │  
                    └──────────┬───────────┘  
                               │  
                               ▼  
                    ┌──────────────────────┐  
                    │  Observability       │  
                    │  Backend             │  
                    │  (Datadog, Dynatrace,│  
                    │   Grafana, etc.)     │  
                    └──────────────────────┘

Signals

OpenTelemetry supports four types of telemetry signals:

1. Traces

Distributed traces track requests as they flow through distributed systems.

Request Flow:  
┌──────────┐      ┌──────────┐      ┌──────────┐      ┌──────────┐  
│  User    │─────▶│  API GW  │─────▶│ Service  │─────▶│ Database │  
└──────────┘      └────┬─────┘      └─────┬────┘      └──────┬───┘  
                       │                  │                  │  
                  [Span A]           [Span B]           [Span C]  
                       │                  │                  │  
                       └──────────────────┴──────────────────┘  
                                    │  
                              [Complete Trace]

Use cases:

  • Request latency analysis
  • Service dependency mapping
  • Root cause analysis for failures

2. Metrics

Numerical measurements of system behavior over time.

Types:

  • Counter: Cumulative value that only increases (e.g., total requests, total errors)
  • Gauge: Point-in-time value that can go up or down (e.g., CPU usage, memory consumption, queue depth)
  • Histogram: Distribution of values with configurable buckets (e.g., request durations, response sizes)
    • Server-side quantile calculation
    • Efficient for aggregation across dimensions
  • Summary: Pre-calculated quantiles (e.g., p50, p90, p99)
    • Client-side quantile calculation
    • Cannot be aggregated across dimensions
    • Common in Prometheus ecosystem

Use cases:

  • Performance monitoring
  • Alerting and SLOs
  • Capacity planning

3. Logs

Timestamped text records of discrete events.

Use cases:

  • Debugging and troubleshooting
  • Audit trails
  • Event analysis

Two Types of Logs in OpenTelemetry

It’s important to distinguish between two different categories of logs when working with the OpenTelemetry Collector:

1. Collector-Level Logs (Operational Logs)

These are logs from the collector itself about its own operation, health, and internal state.

Purpose:

  • Monitor collector health and performance
  • Debug collector configuration issues
  • Track collector startup, shutdown, and errors
  • Operational metrics about the collector process

Configuration:

service:
  telemetry:
    logs:
      level: info  # debug, info, warn, error
      encoding: json  # json or console
      output_paths:
        - stderr
        - /var/log/otelcol.log

Example Collector Logs:

2024-01-15T10:23:45.123Z  info  service/service.go:123  Starting otelcol...
2024-01-15T10:23:45.234Z  info  extensions/extensions.go:45  Extension is starting...
2024-01-15T10:23:45.345Z  warn  batchprocessor/batch.go:89  Queue is 80% full
2024-01-15T10:23:45.456Z  error exporterhelper/export.go:234  Exporting failed  {"error": "connection refused"}

Key Characteristics:

  • Generated by the collector binary itself
  • Configured under service.telemetry.logs
  • Used for operational monitoring and troubleshooting
  • Not part of the telemetry pipeline
2. Pipeline Logs (Application Logs)

These are application logs flowing through the collector as telemetry data, being received, processed, and exported.

Purpose:

  • Collect application logs from services
  • Process and enrich log data
  • Export logs to observability backends
  • Correlate logs with traces and metrics

Configuration:

receivers:
  filelog:
    include: [/var/log/app/*.log]
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
  attributes:
    actions:
      - key: environment
        value: production
        action: insert

exporters:
  otlp:
    endpoint: backend.example.com:4317

service:
  pipelines:
    logs:
      receivers: [filelog, otlp]
      processors: [batch, attributes]
      exporters: [otlp]

Example Pipeline Logs:

{
  "timestamp": "2024-01-15T10:23:45.123Z",
  "severity": "ERROR",
  "body": "Failed to process payment",
  "attributes": {
    "service.name": "payment-service",
    "user_id": "12345",
    "transaction_id": "abc-123"
  },
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7"
}

Key Characteristics:

  • Generated by your applications/services
  • Flows through the logs pipeline (receivers → processors → exporters)
  • Can be correlated with traces via trace_id and span_id
  • Subject to pipeline processing (filtering, transformation, enrichment)
Comparison Summary
AspectCollector-Level LogsPipeline Logs
SourceOTel Collector itselfApplications/services
PurposeCollector operationsApplication telemetry
Config Locationservice.telemetry.logsservice.pipelines.logs
DestinationLocal files, stderrObservability backends
ProcessingNo pipeline processingFull pipeline (receivers, processors, exporters)
Use CaseMonitor the collectorMonitor your applications

Why This Matters:

  • Troubleshooting collector issues: Check collector-level logs
  • Analyzing application behavior: Query pipeline logs in your backend
  • Meta-monitoring: You can send collector-level logs through a separate pipeline to monitor the collector as an application

4. Baggage

Key-value pairs propagated across service boundaries.

Use cases:

  • Passing metadata through distributed systems
  • Feature flags
  • User context (tenant ID, user ID)
Service A                Service B                Service C  
┌─────────┐             ┌─────────┐             ┌─────────┐  
│ user_id │────────────▶│ user_id │────────────▶│ user_id │  
│ tenant  │  (Baggage)  │ tenant  │  (Baggage)  │ tenant  │  
└─────────┘             └─────────┘             └─────────┘

Instrumentation Approaches

1. Zero-Code Instrumentation

Automatic instrumentation without modifying application code.

Method: Attach an agent at runtime

# Java example  
java -javaagent:path/to/opentelemetry-javaagent.jar \  
     -Dotel.service.name=my-service \  
     -jar myapp.jar

Pros:

  • No code changes required
  • Quick to implement
  • Covers common frameworks automatically

Cons:

  • Limited customization
  • May not capture business-specific metrics

2. Code-Based Instrumentation

Manual instrumentation using OpenTelemetry SDKs.

Example (Python):

from opentelemetry import trace  
  
tracer = trace.get_tracer(__name__)  
  
with tracer.start_as_current_span("process_order"):  
    # Your business logic  
    process_payment()  
    update_inventory()

Pros:

  • Full control over what’s instrumented
  • Custom attributes and metrics
  • Business-specific observability

Cons:

  • Requires code changes
  • More development effort

3. Library Instrumentation

Pre-instrumented libraries for popular frameworks.

Examples:

  • opentelemetry-instrumentation-flask (Python)
  • @opentelemetry/instrumentation-express (Node.js)
  • Framework-specific auto-instrumentation

OpenTelemetry Protocol (OTLP)

OTLP is the native protocol for OpenTelemetry, designed for efficient telemetry data transmission.

Key Features:

  • Binary format using Protocol Buffers (efficient)
  • HTTP/1.1, HTTP/2, and gRPC transport
  • Supports all signal types (traces, metrics, logs)

Endpoints:

  • http://collector:4317 - gRPC
  • http://collector:4318 - HTTP

OpenTelemetry Collector

The Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data.

┌────────────────────────────────────────────────────────────────┐  
│                     OpenTelemetry Collector                    │  
│                                                                │  
│  ┌──────────────────────────────────────────────────────────┐  │  
│  │                       RECEIVERS                          │  │  
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌─────────┐   │  │  
│  │  │   OTLP   │  │Prometheus│  │  Jaeger  │  │  Zipkin │   │  │  
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────┬────┘   │  │  
│  └───────┼─────────────┼─────────────┼─────────────┼────────┘  │  
│          │             │             │             │           │  
│          └─────────────┴─────────────┴─────────────┘           │  
│                            │                                   │  
│  ┌─────────────────────────▼──────────────────────────────┐    │  
│  │                     PROCESSORS                         │    │  
│  │  ┌──────────┐  ┌──────────┐  ┌────────────────────┐    │    │  
│  │  │ Batch    │  │ Filter   │  │ Attribute          │    │    │  
│  │  │          │  │          │  │ Enrichment         │    │    │  
│  │  └────┬─────┘  └────┬─────┘  └────────┬───────────┘    │    │  
│  └───────┼─────────────┼─────────────────┼────────────────┘    │  
│          │             │                 │                     │  
│          └─────────────┴─────────────────┘                     │  
│                            │                                   │  
│  ┌─────────────────────────▼──────────────────────────────┐    │  
│  │                      EXPORTERS                         │    │  
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐              │    │  
│  │  │   OTLP   │  │ Datadog  │  │Dynatrace │              │    │  
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘              │    │  
│  └───────┼─────────────┼─────────────┼────────────────────┘    │  
└──────────┼─────────────┼─────────────┼─────────────────────────┘  
           │             │             │  
           ▼             ▼             ▼  
    ┌──────────┐  ┌──────────┐  ┌──────────┐  
    │ Backend  │  │ Backend  │  │ Backend  │  
    │    A     │  │    B     │  │    C     │  
    └──────────┘  └──────────┘  └──────────┘

Why Use a Collector?

  • Decouples telemetry generation from export
  • Centralized processing and transformation
  • Reduces load on applications
  • Enables multi-backend export
  • Data buffering and retry logic

Collector Components Deep Dive

Receivers

Ingest telemetry data from various sources.

Common Receivers:

  • otlp: Native OpenTelemetry protocol
  • prometheus: Scrapes Prometheus metrics
  • jaeger: Receives Jaeger traces
  • zipkin: Receives Zipkin traces
  • hostmetrics: Collects host-level metrics

Processors

Transform, filter, and enrich telemetry data.

Common Processors:

  • batch: Batches telemetry for efficient export
  • memory_limiter: Prevents memory overload
  • attributes: Add/remove/modify attributes
  • filter: Filter out unwanted telemetry
  • resource: Modify resource attributes
  • transform: Advanced data transformation

Attributes Processor

Add, modify, or remove attributes from spans, metrics, or logs.

Configuration Example:

processors:  
  attributes:  
    actions:  
      # Add a new attribute  
      - key: environment  
        value: production  
        action: insert  
  
      # Update existing attribute  
      - key: http.url  
        action: update  
        value: redacted  
  
      # Remove sensitive attributes  
      - key: credit_card  
        action: delete  
  
      # Hash PII data  
      - key: user_email  
        action: hash  
  
      # Extract value from existing attribute  
      - key: http.url  
        pattern: ^https?://([^/]+).*  
        action: extract

Use Cases:

  • Adding deployment/environment metadata
  • Removing PII or sensitive data
  • Normalizing attribute names
  • Enriching telemetry with context

Filter Processor

Filter out entire spans, metrics, or logs based on conditions.

Configuration Example:

processors:  
  filter:  
    # Filter traces  
    traces:  
      span:  
        # Exclude health check endpoints  
        - 'attributes["http.url"] == "/health"'  
        - 'attributes["http.url"] == "/ready"'  
  
        # Exclude successful requests from specific services  
        - 'resource.attributes["service.name"] == "frontend" and attributes["http.status_code"] < 400'  
  
    # Filter metrics  
    metrics:  
      metric:  
        # Exclude specific metrics  
        - 'name == "system.cpu.time"'  
        - 'type == METRIC_DATA_TYPE_HISTOGRAM and name matches "test.*"'  
  
    # Filter logs  
    logs:  
      log_record:  
        # Exclude debug logs in production  
        - 'severity_text == "DEBUG" and resource.attributes["environment"] == "production"'  
        - 'body matches ".*noise.*"'

Use Cases:

  • Reducing data volume by filtering health checks
  • Excluding noisy or low-value telemetry
  • Filtering test data from production pipelines
  • Compliance: excluding entire data points containing sensitive information

Transform Processor

Advanced data transformation using OpenTelemetry Transformation Language (OTTL).

Configuration Example:

processors:  
  transform:  
    # Transform traces  
    trace_statements:  
      - context: span  
        statements:  
          # Redact sensitive data with regex  
          - replace_pattern(attributes["http.url"], "/user/\\d+", "/user/{id}")  
          - replace_pattern(attributes["http.url"], "/account/[^/]+", "/account/{id}")  
  
          # Remove password parameters from URLs  
          - replace_pattern(attributes["http.url"], "password=[^&]*", "password=***")  
  
          # Mask credit card numbers  
          - replace_pattern(attributes["request.body"], "\\d{4}-\\d{4}-\\d{4}-\\d{4}", "****-****-****-****")  
  
          # Delete sensitive attributes  
          - delete_key(attributes, "authorization")  
          - delete_key(attributes, "api_key")  
          - delete_key(attributes, "session_token")  
  
          # Truncate long values  
          - truncate_all(attributes, 4096)  
  
          # Set default values  
          - set(attributes["environment"], "production") where attributes["environment"] == nil  
  
          # Normalize HTTP methods to uppercase  
          - set(attributes["http.method"], Uppercase(attributes["http.method"]))  
  
    # Transform metrics  
    metric_statements:  
      - context: metric  
        statements:  
          # Rename metrics  
          - set(name, "new.metric.name") where name == "old.metric.name"  
  
          # Add labels/attributes  
          - set(attributes["cluster"], "us-west-2")  
  
    # Transform logs  
    log_statements:  
      - context: log  
        statements:  
          # Redact email addresses  
          - replace_pattern(body, "[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}", "***@***.***")  
  
          # Redact IP addresses  
          - replace_pattern(body, "\\b(?:\\d{1,3}\\.){3}\\d{1,3}\\b", "***.***.***.**")  
  
          # Remove sensitive fields from JSON logs  
          - delete_key(attributes, "password")  
          - delete_key(attributes, "ssn")

Use Cases:

  • PII Redaction: Remove or mask personally identifiable information
  • Credential Scrubbing: Remove passwords, API keys, tokens
  • URL Sanitization: Replace dynamic path segments with placeholders
  • Data Normalization: Standardize formats, casing, units
  • Attribute Management: Rename, restructure, or enrich attributes

Redaction Processor (via Transform)

While there’s no separate “redaction processor” in the standard OTel distribution, redaction is achieved using the transform processor with allowlist patterns.

Allowlist Pattern Example:

processors:  
  transform:  
    trace_statements:  
      - context: span  
        statements:  
          # Define allowed attributes  
          - keep_keys(attributes, ["http.method", "http.status_code", "service.name", "environment"])  
  
          # Everything else is automatically dropped

Blocklist Pattern Example:

processors:  
  transform:  
    trace_statements:  
      - context: span  
        statements:  
          # Delete specific sensitive attributes  
          - delete_matching_keys(attributes, ".*password.*")  
          - delete_matching_keys(attributes, ".*token.*")  
          - delete_matching_keys(attributes, ".*secret.*")  
          - delete_matching_keys(attributes, ".*api[_-]?key.*")  
          - delete_matching_keys(attributes, ".*credit[_-]?card.*")

Use Cases:

  • GDPR/CCPA compliance
  • Security: preventing credential leaks
  • Cost optimization: keeping only essential attributes

Connectors

Bridge different pipelines, enabling signal transformation.

Example: Spanmetrics Connector

Trace Pipeline                Metrics Pipeline  
┌────────────┐               ┌────────────┐  
│   Traces   │──────────────▶│  Metrics   │  
│            │  Connector    │            │  
└────────────┘               └────────────┘  
    │                             │  
    ▼                             ▼  
[Export to       [Export metrics with  
 trace backend]   exemplars to metrics backend]

Use Case: Generate RED metrics (Rate, Errors, Duration) from traces.

Exporters

Send processed telemetry to backends.

Common Exporters:

  • otlp: Export to OTLP-compatible backends
  • prometheus: Expose Prometheus metrics endpoint
  • logging: Debug output to console
  • datadog: Export to Datadog
  • jaeger: Export to Jaeger
  • loadbalancing: Distribute load across multiple backend endpoints

Loadbalancing Exporter

The loadbalancing exporter distributes telemetry data across multiple backend instances for better scalability and reliability.

Key Features:

  • Consistent hashing for trace ID-based routing
  • Multiple resolver types for endpoint discovery
  • Automatic failover and health checking
  • Supports traces only (not metrics or logs)

Why Use It?

Without Loadbalancer:              With Loadbalancer:
┌──────────┐                      ┌──────────┐
│Collector │──────────────────▶   │Collector │
└──────────┘                      └────┬─────┘
     │                                 │
     │ All traffic to one             │ Distributed by trace_id
     ▼ backend instance                │
┌──────────┐                      ┌────┴─────┬─────────┬─────────┐
│ Backend  │                      │ Backend1 │Backend2 │Backend3 │
└──────────┘                      └──────────┴─────────┴─────────┘
Single point of failure           Load distributed, HA

Configuration:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 1s
        tls:
          insecure: false
    resolver:
      static:
        hostnames:
          - backend-1.example.com:4317
          - backend-2.example.com:4317
          - backend-3.example.com:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [loadbalancing]
Resolver Types

The loadbalancing exporter supports three resolver types for discovering backend endpoints:

1. Static Resolver

Hardcoded list of backend endpoints.

Configuration:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 1s
    resolver:
      static:
        hostnames:
          - backend-1.example.com:4317
          - backend-2.example.com:4317
          - backend-3.example.com:4317

Use Case:

  • Fixed backend infrastructure
  • Simple deployments
  • Testing and development

Pros: Simple, predictable

Cons: Manual updates required when backends change

2. DNS Resolver

Dynamically discovers backends via DNS A/AAAA records.

Configuration:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 1s
    resolver:
      dns:
        hostname: backends.example.com
        port: 4317
        interval: 5s  # How often to refresh DNS
        timeout: 1s

How It Works:

DNS Query: backends.example.com
     │
     ▼
DNS Server Returns:
  - 10.0.1.10
  - 10.0.1.11
  - 10.0.1.12
     │
     ▼
Collector Updates Backend List:
  - 10.0.1.10:4317
  - 10.0.1.11:4317
  - 10.0.1.12:4317

Use Case:

  • Dynamic backend scaling
  • Cloud environments
  • DNS-based service discovery

Pros: Automatic endpoint discovery

Cons: Depends on DNS infrastructure, potential DNS caching issues

3. Kubernetes Resolver

Discovers backends from Kubernetes service endpoints.

Configuration:

exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 1s
    resolver:
      k8s:
        service: otel-collector-headless
        namespace: observability
        ports:
          - 4317

Requirements:

  • Collector must run in Kubernetes
  • Headless service pointing to backend pods
  • Collector needs RBAC permissions to list endpoints

RBAC Configuration:

apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: otel-collector-k8s-resolver
rules:
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: otel-collector-k8s-resolver
subjects:
  - kind: ServiceAccount
    name: otel-collector
    namespace: observability
roleRef:
  kind: ClusterRole
  name: otel-collector-k8s-resolver
  apiGroup: rbac.authorization.k8s.io

Headless Service Example:

apiVersion: v1
kind: Service
metadata:
  name: otel-collector-headless
  namespace: observability
spec:
  clusterIP: None  # Headless service
  selector:
    app: otel-collector
    role: backend
  ports:
    - name: otlp-grpc
      port: 4317
      targetPort: 4317

How It Works:

Collector watches Kubernetes API
     │
     ▼
Discovers Endpoints:
  - otel-collector-backend-0.otel-collector-headless:4317
  - otel-collector-backend-1.otel-collector-headless:4317
  - otel-collector-backend-2.otel-collector-headless:4317
     │
     ▼
Automatically updates as pods scale

Use Case:

  • Kubernetes-native deployments
  • Auto-scaling backends
  • StatefulSet backends

Pros: Native K8s integration, automatic scaling

Cons: K8s-specific, requires RBAC setup

Load Balancing Strategy

The loadbalancing exporter uses consistent hashing based on trace ID.

Key Behavior:

  • All spans belonging to the same trace go to the same backend
  • Ensures complete traces are stored together
  • Maintains data locality for trace queries
Trace ID: abc123...  ──┐
                       ├─▶ Hash ──▶ Backend 1
Trace ID: def456...  ──┘

Trace ID: ghi789...  ──▶ Hash ──▶ Backend 2

Why Trace ID Hashing?

  • Keeps entire traces together for efficient querying
  • Avoids data fragmentation across backends
  • Enables backend-side trace aggregation
Complete Example: Kubernetes Deployment
# Collector Config
exporters:
  loadbalancing:
    protocol:
      otlp:
        timeout: 5s
        sending_queue:
          enabled: true
          num_consumers: 10
          queue_size: 1000
        retry_on_failure:
          enabled: true
          initial_interval: 1s
          max_interval: 30s
    resolver:
      k8s:
        service: jaeger-collector-headless
        namespace: observability
        ports:
          - 4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [loadbalancing]

Benefits:

  • Distributes trace load across multiple Jaeger collectors
  • Automatic scaling as Jaeger pods scale
  • High availability through multiple backends
  • Trace locality maintained via consistent hashing

Service Pipelines

Define the complete data flow for each signal type.

Example Configuration:

service:  
  pipelines:  
    traces:  
      receivers: [otlp, jaeger]  
      processors: [batch, memory_limiter]  
      exporters: [otlp, jaeger]  
  
    metrics:  
      receivers: [otlp, prometheus]  
      processors: [batch, attributes]  
      exporters: [prometheus, datadog]  
  
    traces/2:  
      receivers: [otlp]  
      processors: [batch]  
      exporters: [spanmetrics]  
  
    metrics/spanmetrics:  
      receivers: [spanmetrics]  
      processors: [batch]  
      exporters: [prometheus]

Span Batch Processing Configuration

The Batch Span Processor batches spans before exporting to improve efficiency.

Key Configuration Parameters

otel.bsp.max.export.batch.size

Maximum number of spans to export in a single batch.

  • Default: 512
  • Recommended: 512-2048 (depends on span size)
  • Configuration Methods:
  • Environment variable: OTEL_BSP_MAX_EXPORT_BATCH_SIZE=1024
  • JVM option: -Dotel.bsp.max.export.batch.size=1024

otel.bsp.max.queue.size

Maximum queue size for buffering spans before batching.

  • Default: 2048

  • Important: Should be ≥ max.export.batch.size

  • If max.export.batch.size is larger than the queue size, it won’t be able to form a batch of that size.

Relationship Diagram:

Span Generation  
      │  
      ▼  
┌─────────────────────────────┐  
│    Queue (max: 2048)        │  
│  ┌───┐┌───┐┌───┐┌───┐       │  
│  │ S ││ S ││ S ││ S │...    │  
│  └───┘└───┘└───┘└───┘       │  
└─────────────┬───────────────┘  
              │ Batch (max: 512)  
              ▼  
┌─────────────────────────────┐  
│      Exporter               │  
└─────────────────────────────┘

otel.bsp.schedule.delay

Maximum time to wait before exporting a partial batch.

  • Default: 5000ms
  • Use: Ensures timely export even with low traffic

Best Practices:

  • Set max.export.batch.size < max.queue.size
  • Monitor queue saturation and dropped spans
  • Tune based on span size and throughput

Error Propagation in Spans

When errors occur in distributed systems, they must be properly recorded and propagated through the trace hierarchy.

Span Status

Spans have a status code that indicates the outcome of the operation:

  • UNSET (default): Status not explicitly set
  • OK: Operation completed successfully
  • ERROR: Operation failed

Recording Errors

Python Example:

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("database_query") as span:
    try:
        result = db.query("SELECT * FROM users")
    except Exception as e:
        # Record the exception
        span.record_exception(e)
        # Set span status to ERROR
        span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
        raise

Java Example:

Span span = tracer.spanBuilder("database_query").startSpan();
try (Scope scope = span.makeCurrent()) {
    result = db.query("SELECT * FROM users");
} catch (Exception e) {
    // Record the exception
    span.recordException(e);
    // Set span status to ERROR
    span.setStatus(StatusCode.ERROR, e.getMessage());
    throw e;
} finally {
    span.end();
}

Standard Exception Attributes

When recording exceptions, these semantic attributes are automatically added:

  • exception.type: Exception class name (e.g., ValueError, TimeoutError, SQLException)
  • exception.message: Exception message
  • exception.stacktrace: Full stack trace (configurable)

Error Propagation Pattern

Error propagation in distributed traces depends on how services handle downstream failures. The root span status is the most critical indicator of overall request success.

What is the Root Span?

The root span is the first span created in a trace - typically at your system’s entry point (e.g., API Gateway, load balancer, or first application service).

Key characteristics:

  • parent_id = null/undefined - has no parent span
  • Generates the trace_id - creates the unique identifier for the entire trace
  • Entry point - first instrumented component to receive the external request
  • Most critical for overall status - root span status determines trace success/failure
  • Measures end-to-end latency - captures complete request duration from entry to exit

Pattern 1: Error Propagation (Root Span = ERROR)

When a critical downstream operation fails and cannot be recovered:

Service A (ERROR)                       Duration: 500ms
├─ Span: API Call                       Status: ERROR
│  error: "Payment Failed"
│
│  └─ Service B (ERROR)                 Duration: 450ms
│     ├─ Span: Process Payment          Status: ERROR
│     │  error: "Insufficient Funds"
│     │
│     └─ Service C (OK)                 Duration: 50ms
│        └─ Span: Check Balance         Status: OK

Overall Trace Status: ERROR (determined by root span)

Even though the balance check succeeded, the trace is failed because:

  • Root span = ERROR → Request failed from user’s perspective
  • Critical operation (Process Payment) failed and couldn’t be recovered
  • Service A detected the failure and set its own span to ERROR

Pattern 2: Graceful Error Handling (Root Span = OK)

When failures are handled gracefully with retries or fallbacks:

Service A (OK)                          Duration: 500ms
├─ Span: API Call                       Status: OK
│  note: "Succeeded via retry"
│
│  ├─ Service B (ERROR)                 Duration: 200ms
│  │  └─ Span: Primary Payment          Status: ERROR
│  │     error: "Gateway timeout"
│  │
│  └─ Service B (OK)                    Duration: 250ms
│     └─ Span: Retry Payment            Status: OK

Overall Trace Status: OK (determined by root span)

The trace succeeded despite containing ERROR spans because:

  • Root span = OK → Request succeeded from user’s perspective
  • Primary payment failed but retry succeeded
  • Service A handled the error gracefully and completed the request
  • Observability backends may still flag as “contains errors” but the request succeeded

Key Behaviors

  • Child span errors don’t automatically propagate to parent spans

    • Each service is responsible for setting its own span status
    • Parent spans must explicitly catch errors from downstream calls and decide whether to:
      • Propagate the error: Set parent to ERROR
      • Handle gracefully: Recover via retry/fallback/circuit breaker, then set parent to OK if recovery succeeds
  • Root span status determines the overall trace outcome

    • Root span status = trace status from user’s perspective
    • This is what matters for user-facing SLIs/SLOs
    • Observability backends use root span status as the primary success/failure indicator
  • Error attributes propagate with the span

    • Exception details (type, message, stacktrace) are stored with the span
    • Available in the observability backend for analysis
    • Each span carries its own error context
  • Span status affects sampling decisions

    • Tail-based samplers can prioritize ERROR spans
    • Ensures errors are captured even with low sampling rates
    • Critical for maintaining visibility into failures

Best Practices

  • Always set span status when catching exceptions

    • Don’t leave error spans with UNSET status
    • Provides clear signal that operation failed
  • Use record_exception() to capture stack traces

    • Invaluable for debugging
    • Configure stack trace depth based on privacy/size concerns
  • Add context-specific attributes

    • user_id, transaction_id, order_id
    • Helps correlate errors with business context
  • Don’t swallow errors silently

    • Even if handled gracefully, record them
    • Helps identify patterns and potential issues
  • Set meaningful error messages

    • Include relevant context in the status message
    • Example: “Payment failed: insufficient funds for user 12345”
  • Consider partial failures

    • If operation partially succeeds, use OK status with warning attributes
    • Reserve ERROR for complete failures

Spanmetrics and Exemplars

Spanmetrics Connector

Generates metrics from trace spans (replaces deprecated spanmetrics processor).

Flow:

Trace Ingestion  
         │  
         ▼  
┌──────────────────────┐  
│  Trace Pipeline      │  
│                      │  
│  Span: /api/checkout │  
│    duration: 245ms   │  
│    status: OK        │  
│    service: web      │  
└────────┬─────────────┘  
         │  
         ▼  
┌──────────────────────┐  
│ Spanmetrics          │  
│ Connector            │  
└────────┬─────────────┘  
         │  
         ▼  
┌──────────────────────┐  
│  Metrics Pipeline    │  
│                      │  
│  duration_sum        │  
│  duration_count      │  
│  calls_total         │  
│  + exemplars         │  
└──────────────────────┘

Generated Metrics:

  • duration_milliseconds_sum: Total duration
  • duration_milliseconds_count: Number of calls
  • calls_total: Total call count
  • Dimensions: service, operation, status_code

Exemplars

Link specific trace examples to aggregated metrics.

Value Proposition:

Metric Alert: High latency on /api/checkout  
      │  
      ▼  
┌─────────────────────────────────────┐  
│  Metric: avg(duration) = 2.3s       │  
│  Exemplar: trace_id=abc123          │<─── Click to jump  
└─────────────────────────────────────┘  
      │  
      ▼  
┌─────────────────────────────────────┐  
│  Trace abc123                       │  
│  Shows exact slow request           │  
│  with all spans and context         │  
└─────────────────────────────────────┘

Enables:

  • Direct navigation from metrics to traces
  • Faster root cause analysis
  • Context-rich debugging

OpenTelemetry Operator

Kubernetes operator for managing OpenTelemetry Collectors.

Architecture

┌────────────────────────────────────────────────────────┐  
│                  Kubernetes Cluster                    │  
│                                                        │  
│  ┌──────────────────────────────────────────────────┐  │  
│  │         OpenTelemetry Operator                   │  │  
│  │                                                  │  │  
│  │  • Manages Collector lifecycle                   │  │  
│  │  • Auto-instrumentation injection                │  │  
│  │  • Configuration management                      │  │  
│  └─────────────────────┬────────────────────────────┘  │  
│                        │                               │  
│                        │ Creates/Manages               │  
│                        ▼                               │  
│  ┌──────────────────────────────────────────────────┐  │  
│  │      OpenTelemetry Collector (CRD)               │  │  
│  │                                                  │  │  
│  │  ┌───────────────────────────────────────────┐   │  │  
│  │  │  Target Allocator (optional)              │   │  │  
│  │  │  • Distributes scrape targets             │   │  │  
│  │  │  • ServiceMonitor discovery               │   │  │  
│  │  │  • Dynamic target allocation              │   │  │  
│  │  └───────────────┬───────────────────────────┘   │  │  
│  │                  │                               │  │  
│  │                  ▼                               │  │  
│  │  ┌──────────────────────────────────────────┐    │  │  
│  │  │  Collector Instances                     │    │  │  
│  │  │  (Deployment/DaemonSet/StatefulSet)      │    │  │  
│  │  └──────────────────────────────────────────┘    │  │  
│  └──────────────────────────────────────────────────┘  │  
└────────────────────────────────────────────────────────┘

Target Allocator

Distributes Prometheus scrape targets across multiple collector instances.

Benefits:

  • Even load distribution
  • Automatic target discovery
  • Scales with collector instances

How It Works:

Target Allocator  
      │  
      ├─── Discovers targets (ServiceMonitors, PodMonitors)  
      │  
      ├─── Assigns targets to collectors  
      │         │  
      │         ├─── Collector 1: [target-a, target-b]  
      │         ├─── Collector 2: [target-c, target-d]  
      │         └─── Collector 3: [target-e, target-f]  
      │  
      └─── Reassigns on collector scale events

Metric collection with Target Allocator

Sampling Strategies

Sampling reduces data volume while maintaining observability.

Types of Sampling

1. Head-Based Sampling

Decision made at trace start (root span) and propagated to all spans in the trace.

Strategies:

  • Always On: Sample everything (100%)
  • Always Off: Sample nothing (0%)
  • Trace ID Ratio: Sample X% based on trace ID hash
  • Rate Limiting: Sample N traces per second

Pros: Simple, predictable data volume

Cons: May miss important traces

┌──────────────┐  
│  Root Span   │──▶ Sample? (Random 10%)  
└──────┬───────┘  
       │  
       ├─────▶ Span A  │  
       ├─────▶ Span B  │──▶ All kept or all dropped  
       └─────▶ Span C  │

2. Tail-Based Sampling

Decision made after trace completion.

Criteria:

  • Error status
  • Latency threshold
  • Specific attributes (e.g., user_id)

Pros: Captures important traces (errors, slow requests)

Cons: Requires buffering, more complex

Complete Trace  
      │  
      ▼  
┌─────────────────┐  
│  Evaluation     │  
│  • Duration >5s │──▶ Keep  
│  • Has errors   │──▶ Keep  
│  • Random 1%    │──▶ Maybe keep  
└─────────────────┘

3. Probabilistic Sampling

Each span independently sampled based on probability.

Use Case: High-throughput systems where tail-based sampling is impractical.

Sampling Configuration Example

processors:  
  probabilistic_sampler:  
    sampling_percentage: 10  
  
  tail_sampling:  
    policies:  
      - name: errors  
        type: status_code  
        status_code: {status_codes: [ERROR]}  
      - name: slow  
        type: latency  
        latency: {threshold_ms: 5000}  
      - name: random  
        type: probabilistic  
        probabilistic: {sampling_percentage: 1}

Deployment Patterns

1. Agent Pattern

Collector runs alongside application (sidecar or DaemonSet).

┌─────────────────────────────┐  
│         Node/Pod            │  
│                             │  
│  ┌──────────┐               │  
│  │   App    │               │  
│  └────┬─────┘               │  
│       │ localhost:4317      │  
│       ▼                     │  
│  ┌──────────┐               │  
│  │Collector │               │  
│  │ (Agent)  │               │  
│  └────┬─────┘               │  
└───────┼─────────────────────┘  
        │  
        ▼  
   Backend

Pros:

  • Low latency
  • Simplified application configuration
  • Resource isolation

Cons:

  • Resource overhead per node/pod
  • Harder to centralize configuration

2. Gateway Pattern

Centralized collector cluster.

┌───────┐  ┌───────┐  ┌───────┐  
│ App 1 │  │ App 2 │  │ App 3 │  
└───┬───┘  └───┬───┘  └───┬───┘  
    │          │          │  
    └──────────┼──────────┘  
               │  
               ▼  
    ┌──────────────────┐  
    │   Collector      │  
    │   Gateway        │  
    │   (Cluster)      │  
    └──────────┬───────┘  
               │  
               ▼  
            Backend

Pros:

  • Centralized processing and configuration
  • Reduced resource usage per application
  • Easy to scale independently

Cons:

  • Additional network hop
  • Single point of failure (mitigated by clustering)

3. Hybrid Pattern

Combines agent and gateway patterns.

┌─────────────────┐  
│  App + Agent    │  
└────────┬────────┘  
         │ (lightweight)  
         ▼  
┌─────────────────┐  
│   Gateway       │  
│   (heavy        │  
│   processing)   │  
└────────┬────────┘  
         │  
         ▼  
      Backend

Use Case:

  • Agents handle basic batching
  • Gateway performs expensive processing (tail sampling, enrichment)

Context Propagation

Context propagation is the mechanism that allows trace information to flow across service boundaries, enabling distributed tracing in microservices architectures.

The Problem: Tracking Requests Across Services

Imagine a user request that flows through multiple services:

User Request: "Get Order #12345"  
    │  
    ├──▶ API Gateway (generates span)  
    │       │  
    │       └──▶ Order Service (generates span)  
    │               │  
    │               ├──▶ Payment Service (generates span)  
    │               └──▶ Inventory Service (generates span)

Without context propagation:

  • Each service creates independent, disconnected spans
  • You can’t connect spans together to see the full request flow
  • No way to know which spans belong to the same user request

With context propagation:

  • All spans are linked by a common trace_id
  • You can reconstruct the entire request journey
  • End-to-end visibility across all services

What is Trace Context?

Trace Context is metadata that gets passed between services to maintain tracing continuity. It contains:

  • trace_id: Unique identifier for the entire request (stays the same across all services)
  • span_id: Unique identifier for the current operation (changes at each service)
  • trace_flags: Sampling decisions and other flags

Think of it like a package delivery:

  • trace_id = Tracking number (same for the entire journey)
  • span_id = Each checkpoint’s receipt ID (different at each location)
  • The tracking number connects all checkpoints together

How It Works: Step by Step

1. User makes request  
   ↓  
2. Service A (API Gateway)  
   • Creates NEW trace_id: "abc123"  
   • Creates span_id: "span-001"  
   • Processes request  
   • Calls Service B  
   • Attaches trace_id + span_id to HTTP request  
   ↓  
3. Service B (Order Service) receives request  
   • Extracts trace_id: "abc123" (keeps the same!)  
   • Extracts parent_span_id: "span-001" (for linking)  
   • Creates NEW span_id: "span-002"  
   • Processes request  
   • Calls Service C  
   • Attaches trace_id + NEW span_id to HTTP request  
   ↓  
4. Service C (Payment Service) receives request  
   • Extracts trace_id: "abc123" (still the same!)  
   • Extracts parent_span_id: "span-002"  
   • Creates NEW span_id: "span-003"  
   • Processes request

Result: All spans share trace_id: "abc123" and can be visualized as a connected trace:

Trace: abc123  
│  
├─ Span: span-001 (API Gateway)    [200ms]  
│  │  
│  ├─ Span: span-002 (Order Service)    [150ms]  
│  │  │  
│  │  ├─ Span: span-003 (Payment Service)    [80ms]  
│  │  └─ Span: span-004 (Inventory Service)  [40ms]

How is Context Transmitted?

Context is transmitted via HTTP headers (or equivalent for other protocols like gRPC, message queues).

Example HTTP Request:

GET /api/orders/12345 HTTP/1.1  
Host: order-service.example.com  
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01  
tracestate: vendor1=value1,vendor2=value2

The receiving service reads these headers, extracts the trace context, and creates its own span within the same trace.

W3C Trace Context Standard

W3C Trace Context is the standardized format for transmitting trace context across services. It defines exactly how trace information should be encoded in HTTP headers.

The traceparent Header

This is the required header that contains core trace context.

Format:

traceparent: VERSION-TRACE_ID-PARENT_SPAN_ID-TRACE_FLAGS

Real Example:

traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

Breaking it down piece by piece:

00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01  
│   │                                │                │  
│   │                                │                └─── [4] Trace Flags  
│   │                                └──────────────────── [3] Parent Span ID  
│   └───────────────────────────────────────────────────── [2] Trace ID  
└───────────────────────────────────────────────────────── [1] Version

Component Details:

  • Version (00):

    • Format version of W3C Trace Context
    • Currently 00 (version 0)
  • Trace ID (0af7651916cd43dd8448eb211c80319c):

    • 32 hex characters = 128 bits
    • Uniquely identifies the entire trace across all services
    • Never changes as the request flows through services
    • Generated once by the first service
  • Parent Span ID (b7ad6b7169203331):

    • 16 hex characters = 64 bits
    • Identifies the span that made this request (the parent)
    • Changes at each service hop
    • Used to build the parent-child relationship in the trace tree
  • Trace Flags (01):

    • 2 hex characters = 8 bits
    • Bit flags for sampling decisions
    • 01 = sampled (trace is being recorded)
    • 00 = not sampled (trace is ignored)

Visual Example of Header Propagation

┌─────────────────────────────────────────────────────────────────┐  
│                     Service A (API Gateway)                     │  
│                                                                 │  
│  1. Receives request with NO traceparent header                 │  
│  2. Creates NEW trace:                                          │  
│     • trace_id = abc123...                                      │  
│     • span_id  = 111111...                                      │  
│  3. Makes HTTP call to Service B with header:                   │  
│                                                                 │  
│     traceparent: 00-abc123...-111111...-01                      │  
│                      │         │         │                      │  
│                      │         │         └─ Sampled             │  
│                      │         └─────────── Parent span         │  
│                      └─────────────────── Same trace ID         │  
└─────────────────────────────────────────────────────────────────┘  
                               │  
                               ▼  
┌─────────────────────────────────────────────────────────────────┐  
│                     Service B (Order Service)                   │  
│                                                                 │  
│  1. Receives header:                                            │  
│     traceparent: 00-abc123...-111111...-01                      │  
│                                                                 │  
│  2. Extracts context:                                           │  
│     • trace_id = abc123... (KEEP THIS!)                         │  
│     • parent_span_id = 111111... (for linking)                  │  
│                                                                 │  
│  3. Creates NEW span_id = 222222...                             │  
│                                                                 │  
│  4. Makes HTTP call to Service C with header:                   │  
│     traceparent: 00-abc123...-222222...-01                      │  
│                      │         │         │                      │  
│                      │         │         └─ Sampled             │  
│                      │         └─────────── NEW parent span     │  
│                      └─────────────────── SAME trace ID         │  
└─────────────────────────────────────────────────────────────────┘  
                               │  
                               ▼  
┌─────────────────────────────────────────────────────────────────┐  
│                   Service C (Payment Service)                   │  
│                                                                 │  
│  1. Receives header:                                            │  
│     traceparent: 00-abc123...-222222...-01                      │  
│                                                                 │  
│  2. Extracts context:                                           │  
│     • trace_id = abc123... (SAME trace!)                        │  
│     • parent_span_id = 222222...                                │  
│                                                                 │  
│  3. Creates NEW span_id = 333333...                             │  
│                                                                 │  
│  4. Processes payment (no further calls)                        │  
└─────────────────────────────────────────────────────────────────┘

Key Insight:

  • trace_id never changes → All spans belong to the same trace
  • span_id changes at each hop → Creates parent-child relationships

The tracestate Header (Optional)

The tracestate header allows vendors to add their own proprietary information without breaking the standard.

Format:

tracestate: key1=value1,key2=value2

Example:

tracestate: datadog=s:2;o:rum,congo=t61rcWkgMzE

Use Cases:

  • Vendor-specific sampling decisions
  • Additional vendor context
  • A/B testing flags
  • Regional routing information

Real-World Example

Let’s trace an actual user request with real headers:

User Action: Order a pizza

1. Frontend (Browser) → API Gateway

POST /api/orders HTTP/1.1  
Host: api.pizza.com  
Content-Type: application/json  
  
{"pizza": "Margherita", "size": "Large"}

API Gateway generates:

trace_id: 4bf92f3577b34da6a3ce929d0e0e4736  
span_id:  00f067aa0ba902b7

2. API Gateway → Order Service

POST /orders HTTP/1.1  
Host: order-service.internal  
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01  
  
{"pizza": "Margherita", "size": "Large"}

Order Service extracts trace_id: 4bf92f3577b34da6a3ce929d0e0e4736

Order Service creates span_id: 1234567890abcdef

3. Order Service → Payment Service

POST /payments HTTP/1.1  
Host: payment-service.internal  
traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-1234567890abcdef-01  
  
{"amount": 12.99, "order_id": "789"}

Payment Service extracts trace_id: 4bf92f3577b34da6a3ce929d0e0e4736

Payment Service creates span_id: fedcba0987654321

4. Result in Observability Backend:

Trace: 4bf92f3577b34da6a3ce929d0e0e4736  
│  
├─ [API Gateway]     span_id: 00f067aa0ba902b7    Duration: 523ms  
│  │  
│  └─ [Order Service]  span_id: 1234567890abcdef  Duration: 450ms  
│     │  
│     └─ [Payment Service] span_id: fedcba0987654321 Duration: 230ms

Why This Matters

Without W3C Trace Context:

  • Every vendor had their own format (X-B3-TraceId, X-Trace-Id, etc.)
  • Services instrumented with different vendors couldn’t propagate traces
  • Breaking compatibility when switching observability tools

With W3C Trace Context:

  • Standardized format everyone implements
  • Works across vendors (Datadog, Dynatrace, Jaeger, etc.)
  • Future-proof as you switch tools
  • Interoperability in polyglot architectures

Summary:

  • Trace Context = Metadata identifying which trace a span belongs to
  • Context Propagation = Passing that metadata between services
  • W3C Trace Context = The standardized format (via traceparent header)
  • Purpose = Connect all spans in a distributed request into a single cohesive trace

Testing and Telemetry Generation

Manual Trace Generation with telemetrygen

# Generate traces with mTLS  
telemetrygen traces \  
  --otlp-endpoint "collector.example.com:4317" \  
  --service "test-service" \  
  --duration 1m \  
  --rate 1 \  
  --client-cert "client.chain.pem" \  
  --client-key "client-key.pem" \  
  --ca-cert "trusted-root.pem" \  
  --mtls  
  
# Generate metrics  
telemetrygen metrics \  
  --otlp-endpoint "localhost:4317" \  
  --duration 30s \  
  --rate 10  
  
# Generate logs  
telemetrygen logs \  
  --otlp-endpoint "localhost:4317" \  
  --duration 1m \  
  --rate 5

Use Cases:

  • Collector testing
  • Load testing observability pipelines
  • Validating configurations
  • Demo and training

Additional Important Concepts

Data Transformation

Transform telemetry data in-flight using processors.

Examples:

  • Redacting sensitive data (PII, credentials)
  • Adding resource attributes (cluster, region)
  • Normalizing attribute names
  • Converting units

Transform Processor Example:

processors:  
  transform:  
    trace_statements:  
      - context: span  
        statements:  
          - set(attributes["environment"], "production")  
          - delete_key(attributes, "password")  
          - replace_pattern(name, "/user/\\d+", "/user/{id}")

Resource Detection

Automatically detect and add resource attributes.

Detectors:

  • env: From environment variables
  • ec2: AWS EC2 metadata
  • gcp: Google Cloud metadata
  • kubernetes: Kubernetes pod/node info
  • docker: Docker container info

Example:

processors:  
  resourcedetection:  
    detectors: [env, kubernetes, gcp]  
    timeout: 5s

High Availability

Strategies:

  • Run multiple collector instances
  • Use load balancers
  • Implement health checks
  • Configure retry logic
  • Set up persistent queues

Scrape Jobs

Scrape JobSource ComponentMetrics FocusDeployed AsOne PerScraped From
kube-state-metricsKubernetes API via exporterCluster object statesDeploymentClusterkube-state-metrics.kube-system.svc:8080
kubeletKubelet on each nodeNode (k8s-specific) & pod resource usageBuilt-inNodehttps://:10250/metrics
cadvisorEmbedded in KubeletContainer-level resource usageEmbeddedNodehttps://:10250/metrics/cadvisor
node_exporterNode-level agentGeneric host OS metricsDaemonSetNodehttp://:9100/metrics
envoy-statsEnvoy proxy sidecarService mesh traffic statsSidecarPod127.0.0.1:15000/stats/prometheus (in pod)
istiodIstio control planeMesh config & control planeDeploymentClusteristiod.istio-system.svc/metrics
istio-ingressIstio ingress gatewayExternal traffic observabilityDeploymentCluster:15090/stats/prometheus (on ingress pod)

To set up Zipkin traces in the OpenTelemetry Collector (OTel Collector)

  • Add the Zipkin receiver in the OTEL Collector config
receivers:  
  zipkin:  
    endpoint: 0.0.0.0:9411
  • Expose the Zipkin port (usually 9411) in the OTEL Collector (agent)

    • If running as a container: expose 9411/tcp
    • If running in Kubernetes: expose port 9411 in the Service and containerPort in the Pod
  • Point your application to send traces to the OTEL Collector Zipkin endpoint

ZIPKIN_ENDPOINT=http://<otel-agent-service>:9411/api/v2/spans
  • Ensure that your app is using a Zipkin-compatible exporter.

Best Practices Summary

  • Start with auto-instrumentation, add manual instrumentation for business logic
  • Use the Collector for production deployments
  • Implement sampling for high-throughput systems
  • Enable exemplars to bridge metrics and traces
  • Propagate context correctly across all services
  • Monitor the Collector itself (meta-monitoring)
  • Use semantic conventions for attribute naming
  • Tune batch processing based on throughput
  • Secure OTLP endpoints with mTLS in production
  • Test configurations with telemetrygen before deploying

References

Last updated on