There are many potential data sources for observing applications or infrastructure. But for most observability use cases, three types of data matter the most: logs, metrics and traces.
These data types play such a key role in cloud-native observability workflows that they're known as the three pillars of observability. Each pillar provides a different perspective of an organization's resources. When these data sources are combined and analyzed, the organization gains a holistic understanding of what's happening within its complex application environments.
What are logs?
Logs are files that record events, warnings and errors as they occur within a software environment. Most logs include contextual information, such as the time an event occurred and which user or endpoint was associated with it.
For example, a log file for a web server might include when the server started, requests from clients and how the server responded to those requests. It records information about each successful transaction as well as errors such as failed connections to clients.
This article is part of
Errors and warnings are sometimes recorded in separate log files, but all types of logging data can be recorded in a single file. For observability purposes, it doesn't matter how logs are organized, as most observability tools aggregate data from multiple log files and analyze it collectively.
Benefits and limitations of event logs
Logs are a pillar of observability because they provide a comprehensive record of all events and errors that take place during the lifecycle of software resources. If you want to know when a problem occurred, or which events or trends correlate with it, logs are an excellent source of visibility.
However, logs can have important limitations. One of the biggest is that they record only the events, warnings and errors the logging software has been configured to record. Unless your logging tools and settings are configured to register certain information, it won't appear in your log files.
Another challenge with logs from an observability perspective is that log data isn't always persistent. For instance, in most cases logs created by containerized applications will disappear permanently when the container shuts down. Engineers can address this issue by moving the log data somewhere else while the container is still running, but there is still a risk that some log files will be overlooked or lost.
What are metrics?
Metrics are quantifiable measurements that reflect the health and performance of applications or infrastructure. For example, application metrics might track how many transactions the application handles per second, while infrastructure metrics measure how many CPU or memory resources are consumed on a server.
There are many possible types of metrics that can be tracked. Two popular methods of defining metrics are Weaveworks' RED Method, which focuses on rates, errors and request duration; and Google's Golden Signals method, which measures latency, traffic, errors and saturation.
Benefits and limitations of metrics
The main benefit of metrics is that they provide real-time insight into the state of resources. If you want to know how responsive your application is or identify anomalies that could be early signs of a performance issue, metrics are a key source of visibility.
By correlating metrics with data from logs and traces, organizations gain the fullest possible context on system performance or potential availability issues. This is why metrics are particularly important for observability.
However, like logs, metrics only keep track of the application and infrastructure data they were designed to record. In addition, metrics aren't typically useful for pinpointing the source of a problem, especially in a complex distributed system. For example, while metrics data might indicate that your application is experiencing a high rate of errors, metrics aren't granular or detailed enough to identify exactly which service within a microservices architecture is triggering the errors. Metrics only show that the application is experiencing errors.
What are distributed traces?
A distributed trace is data that tracks an application request as it flows through the various parts of an application. The trace records how long it takes each application component to process the request and pass the result to the next component. Traces can also identify which parts of the application trigger an error.
Benefits and limitations of distributed traces
If you need to research the root cause of a problem, distributed traces are the most effective way to accomplish this. Although logs and metrics might help you know a problem exists, it's difficult to pinpoint the source of the problem in microservices environments without running traces.
The major limitation of distributed traces is that only a fraction of all application requests are traced in most cases. Running traces takes too much time and consumes too many resources to trace every request an application receives. This means you might not always have tracing data available when an error occurs.
In addition, because every application request can be unique, the data in one distributed trace doesn't necessarily enable you to troubleshoot problems related to other requests. The data associated with the requests, endpoints and client-side configurations is likely to vary between requests, so the extent to which you can extrapolate on the basis of one trace to draw conclusions about the application as a whole is limited.
How do logs, metrics and traces work together?
As noted earlier, logs, metrics and traces each provide a valuable, but limited, level of visibility into software environments. However, when you combine these sources, you get a relatively complete picture of what's happening in an environment.
For instance, you might notice from continuous metrics tracking that the application response rate is slowing down, which could indicate a performance issue. But before assuming there's a problem, you'd want to look at the application's logs to check whether the slower responses can be explained by a benign change, such as the app handling more complex transactions than it normally does. If you determine that the application performance degradation reflects a problem, you could then use distributed trace data to identify which specific microservice is triggering it.
Criticisms of the 3 pillars
Although analyzing logs, metrics and traces simultaneously enables engineers to gain a broad understanding of the state of an environment, teams should not limit themselves to these three data sources alone. The more data you have to inform observability workflows, the better.
It can be useful, for example, to contextualize logs, metrics and traces with data from a CI/CD pipeline to help you determine which application update or redeployment correlates with a performance degradation. Likewise, business metrics, such as customer retention rates, could be correlated with technical observability metrics to help gauge the effects of technical problems on business performance.
If you want to observe cloud-native environments, start by collecting and analyzing logs, metrics and traces. These aren't the only potential sources of observability, but they are the most important ones, which is what makes them the three pillars of observability.