Distributed tracing is a method for IT and DevOps teams to monitor applications and discover where failures can occur within a system. It helps maintain distributed applications and microservice architecture and ensure everything runs as smoothly as possible.
Distributed tracing is similar to application performance monitoring -- both work to monitor and manage applications -- but distributed tracing requires code-level support because it collects request data moving between services. Developers must instrument an application's code to provide admins with the information necessary to analyze an application's performance and debugging issues.
How distributed tracing works
Distributed tracing works with traces and spans. A trace is the complete request process, and each trace is made up of spans. A span is a tagged time interval and is the activity within a system's individual components or services. IT admins can determine a problem's source by assessing each span within a trace.
As applications grow in size with the addition of new technologies, such as containers, cloud and serverless, new points of failure are introduced, which adds more pressure for IT admins to solve problems as quickly as possible.
In addition, microservices might bring advantages for DevOps teams, but they reduce system visibility, and IT teams can lose sight of the big picture as the scope spreads across microservices, teams and functions. IT teams could spend countless hours looking for issues in the wrong place without proper guidance.
Distributed tracing provides a broad overview of an application's system and pinpoints where microservice communication experiences errors. It works to track and log each request as it crosses through an IT infrastructure's services. For example, with distributed tracing, system architects can virtualize the performance of each function's iteration. This way, IT teams can pinpoint exactly which function instance is causing latency and address the problem.
Distributed tracing tools
Organizations use distributed tracing tools because they can help accurately find issues and reveal service dependencies. Jaeger, Prometheus, VictoriaMetrics, Cortex and Zipkin are all distributed tracing tools that present data collected from individual applications.
This list is not ranked.
Jaeger is an open source system that focuses on monitoring and troubleshooting microservice-based distributed systems. Jaeger includes features for distributed transaction monitoring, performance and latency optimization, root cause analysis, service dependency analysis, and distributed context propagation.
Inspired by Dapper and OpenZipkin, the project provides an OpenTracing-style data model, adaptive sampling, system topology graphs and service performance monitoring. OpenTracing is a method to trace transactions with vendor-neutral APIs and instrumentation. It helps ensure all the tracers within one system can coexist.
Jaeger's back end, web UI and instrumentation libraries all support OpenTracing standards. In addition, Jaeger version 1.35 can receive data from OpenTelemetry SDKs in their native OpenTelemetry Protocol -- but the internal data representation and UI follow the OpenTracing specification model.
Jaeger's back-end components expose Prometheus metrics by default for observability. All back-end components are implemented in Go. But Jaeger only supports C#, Java, Node.js, Python and Go.
For storage, Jaeger provides multiple back ends by default that include Cassandra, Elasticsearch and in-memory storage, but Jaeger can also run with databases. Jaeger has plans at the time of publication to include a post-collection data processing pipeline in a future version update.
Jaeger can be downloaded for free.
Prometheus is an open source service that collects and stores metrics as time-series data. It retrieves data via the HTTP pull method in a time-series database. It does not rely on distributed storage, and all single server nodes are autonomous. Prometheus finds targets with service discovery or static configuration.
Each Prometheus server stands alone. It's not dependent on network storage or other remote services, which enables it to diagnose an issue in the event of an outage.
Prometheus can record numeric time-series data and supports cross-platform data collection and querying. In addition, it provides integrations with more than 150 third-party systems, such as Splunk, Kafka, Thanos, Gnocchi and Wavefront.
Prometheus is a monitoring tool and is not a good choice to collect and analyze information such as billing data. This is because the data it collects will not be detailed enough and cannot provide 100% accuracy. In addition, its local storage has a 15-day retention period, so long-term storage requires external platforms or services.
Prometheus is free to download.
VictoriaMetrics processes high volumes of data and supports long-term data storage. It enables IT teams to build monitoring platforms without having to worry about scalability or operational burdens. In addition, it provides Grafana dashboards, as well as a single binary and Kubernetes operator for the clustered version.
IT teams can run VictoriaMetrics in either Docker, Kubernetes or bare metal. It accepts metrics in InfluxDB, Graphite, OpenTSDB and CSV protocols.
VictoriaMetrics has four tools: VictoriaMetrics, VictoriaMetrics Enterprise, Managed VictoriaMetrics and Monitoring of Monitoring (MoM).
The enterprise version provides more security options and limits access based on identity. It also includes enterprise support, downsampling to save disk space, and multi-tenant statistics to limit excess use and avoid system overload. Managed VictoriaMetrics enables organizations to run the system on AWS without the need to perform tasks such as configuration, updates or backups. Organizations can also use MoM and VictoriaMetrics to monitor system issues and send notifications if anything goes wrong.
VictoriaMetrics is free. For the Managed VictoriaMetrics service, solid-state drive storage is $0.002 per unit per 10 GB-hour; compute unit is $0.25 per unit per vCPU hour; and outward-flowing data is $0.09 per unit per GB of data transferred. For the enterprise version and MoM, VictoriaMetrics provides pricing upon request.
Cortex provides organizations visibility into their microservices' status and quality. The service helps development and engineering teams build software at scale, and ensures site reliability engineers, software engineers and developers are all on the same page with uniform data sets. It also reduces or eliminates application sprawl with one central dashboard.
Cortex provides a wide range of integration options, such as AWS, Azure DevOps, Datadog, Kubernetes and New Relic. Cortex also uses Jaeger for distributed tracing. IT teams must set up a Jaeger deployment to send traces with Cortex.
It also offers service templates to help IT teams implement microservices across the organization and enables admins to create customizable catalogs.
Cortex can run as a single binary for easier deployment or as multiple independent microservices for production use. It includes Cortex Query Language (CQL) -- along with the service catalog -- so that teams can receive automatic answers about a service's status and keep accurate documentation. Cortex CQL and integrations join the collected data so that IT teams can find what they need.
Cortex provides pricing upon request.
Zipkin is an open source project based on Google's Dapper tool that helps capture timing data to troubleshoot latency problems in distributed systems. The system is implemented in Java and has an OpenTracing-compatible API. With Zipkin, IT teams can send, receive, store and visualize traces both within and between services.
Zipkin's architecture includes a collector that looks up functions on trace data, and a reporter that transports spans and trace data from tracer libraries back to Zipkin. It has a web UI to view traces, and an API for queries and extract traces. It also has trace IDs that attach to each request and identify requests across services. In addition, Zipkin compares traces to determine which services or operations take longer to operate than others.
Zipkin's built-in UI is a limited and self-contained web application. The UI provides dependency graphs that display how many trace requests went through each application to help find the exact error location.
To report trace data to Zipkin, IT admins must instrument applications with either HTTP, Apache Kafka, Apache ActiveMQ or gRPC. For back-end scalable storage, Zipkin supports Cassandra and Elasticsearch.
To build and run a Zipkin instance, IT teams can use Java or Docker, or run from source.