Tech Accelerator What is observability? A beginner's guide

Prev Next

Tip

The 3 pillars of observability: Logs, metrics and traces

Logs, metrics and traces offer their individual perspectives on system performance. When analyzed together, they provide a complete picture of your infrastructure.

Chris Tozzi

By

Chris Tozzi

Published: 07 Jun 2022

There are many potential data sources for observing applications or infrastructure. But for most observability use cases, three types of data matter the most: logs, metrics and traces.

These data types play such a key role in cloud-native observability workflows that they're known as the three pillars of observability. Each pillar provides a different perspective of an organization's resources. When these data sources are combined and analyzed, the organization gains a holistic understanding of what's happening within its complex application environments.

What are logs?

Logs are files that record events, warnings and errors as they occur within a software environment. Most logs include contextual information, such as the time an event occurred and which user or endpoint was associated with it.

For example, a log file for a web server might include when the server started, requests from clients and how the server responded to those requests. It records information about each successful transaction as well as errors such as failed connections to clients.

This article is part of

What is observability? A beginner's guide

Which also includes:
Common use cases for observability
Observability vs. monitoring: What's the difference?
8 observability best practices

Errors and warnings are sometimes recorded in separate log files, but all types of logging data can be recorded in a single file. For observability purposes, it doesn't matter how logs are organized, as most observability tools aggregate data from multiple log files and analyze it collectively.

Benefits and limitations of event logs

Logs are a pillar of observability because they provide a comprehensive record of all events and errors that take place during the lifecycle of software resources. If you want to know when a problem occurred, or which events or trends correlate with it, logs are an excellent source of visibility.

However, logs can have important limitations. One of the biggest is that they record only the events, warnings and errors the logging software has been configured to record. Unless your logging tools and settings are configured to register certain information, it won't appear in your log files.

Another challenge with logs from an observability perspective is that log data isn't always persistent. For instance, in most cases logs created by containerized applications will disappear permanently when the container shuts down. Engineers can address this issue by moving the log data somewhere else while the container is still running, but there is still a risk that some log files will be overlooked or lost.

What are metrics?

Metrics are quantifiable measurements that reflect the health and performance of applications or infrastructure. For example, application metrics might track how many transactions the application handles per second, while infrastructure metrics measure how many CPU or memory resources are consumed on a server.

There are many possible types of metrics that can be tracked. Two popular methods of defining metrics are Weaveworks' RED Method, which focuses on rates, errors and request duration; and Google's Golden Signals method, which measures latency, traffic, errors and saturation.

Benefits and limitations of metrics

The main benefit of metrics is that they provide real-time insight into the state of resources. If you want to know how responsive your application is or identify anomalies that could be early signs of a performance issue, metrics are a key source of visibility.

By correlating metrics with data from logs and traces, organizations gain the fullest possible context on system performance or potential availability issues. This is why metrics are particularly important for observability.

However, like logs, metrics only keep track of the application and infrastructure data they were designed to record. In addition, metrics aren't typically useful for pinpointing the source of a problem, especially in a complex distributed system. For example, while metrics data might indicate that your application is experiencing a high rate of errors, metrics aren't granular or detailed enough to identify exactly which service within a microservices architecture is triggering the errors. Metrics only show that the application is experiencing errors.

What are distributed traces?

A distributed trace is data that tracks an application request as it flows through the various parts of an application. The trace records how long it takes each application component to process the request and pass the result to the next component. Traces can also identify which parts of the application trigger an error.

Benefits and limitations of distributed traces

If you need to research the root cause of a problem, distributed traces are the most effective way to accomplish this. Although logs and metrics might help you know a problem exists, it's difficult to pinpoint the source of the problem in microservices environments without running traces.

The major limitation of distributed traces is that only a fraction of all application requests are traced in most cases. Running traces takes too much time and consumes too many resources to trace every request an application receives. This means you might not always have tracing data available when an error occurs.

In addition, because every application request can be unique, the data in one distributed trace doesn't necessarily enable you to troubleshoot problems related to other requests. The data associated with the requests, endpoints and client-side configurations is likely to vary between requests, so the extent to which you can extrapolate on the basis of one trace to draw conclusions about the application as a whole is limited.

The three pillars of observability: logs, metrics and traces.

How do logs, metrics and traces work together?

As noted earlier, logs, metrics and traces each provide a valuable, but limited, level of visibility into software environments. However, when you combine these sources, you get a relatively complete picture of what's happening in an environment.

For instance, you might notice from continuous metrics tracking that the application response rate is slowing down, which could indicate a performance issue. But before assuming there's a problem, you'd want to look at the application's logs to check whether the slower responses can be explained by a benign change, such as the app handling more complex transactions than it normally does. If you determine that the application performance degradation reflects a problem, you could then use distributed trace data to identify which specific microservice is triggering it.

Criticisms of the 3 pillars

Although analyzing logs, metrics and traces simultaneously enables engineers to gain a broad understanding of the state of an environment, teams should not limit themselves to these three data sources alone. The more data you have to inform observability workflows, the better.

It can be useful, for example, to contextualize logs, metrics and traces with data from a CI/CD pipeline to help you determine which application update or redeployment correlates with a performance degradation. Likewise, business metrics, such as customer retention rates, could be correlated with technical observability metrics to help gauge the effects of technical problems on business performance.

If you want to observe cloud-native environments, start by collecting and analyzing logs, metrics and traces. These aren't the only potential sources of observability, but they are the most important ones, which is what makes them the three pillars of observability.

Dig Deeper on IT systems management and monitoring

Search Software Quality

Google adds Gemini CLI for GitHub Actions coding agent
The beta version of Google Gemini CLI for GitHub Actions starts simple and builds in security, but overall, the 'honeymoon phase'...
Scrum master certification exam questions and answers
Are you ready for the Scrum master certification exam? Test yourself on these 10 tough Scrum master exam questions and answers.
8 examples of ethical issues in software development
As software becomes entrenched in every aspect of the human experience, developers have an ethical responsibility to their ...

Search App Architecture

Insomnia vs. Postman: Comparing API management tools
Insomnia has a streamlined interface and focus. Postman has extensive features for end-to-end development. Choosing comes down to...
8 best practices for creating architecture decision records
An ADR is only as good as the record quality. Follow these best practices to establish a dependable ADR creation and maintenance ...
Refactor vs. rewrite: Deciding how to fix problem software
At some point, all developers must decide whether to refactor code or rewrite it. Base this choice on factors such as ...

Search Cloud Computing

AWS reports 17.5% growth, fails to impress investors
Amazon's cloud business delivered better-than-expected growth in the second quarter, but pales in comparison with results from ...
Prep data for machine learning with AWS analytics services
Data preparation is crucial when building and training machine learning models with SageMaker AI. What AWS analytics services can...
Microsoft Q4 earnings surge on cloud results; AI gains steam
Booming cloud business drove fourth-quarter and full-year results past analyst expectations as the AI race continues to heat up.

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

TheServerSide.com

Product backlog vs. sprint backlog: What's the difference?
The sprint backlog and product backlog are important elements of Scrum and essential to iterative and incremental development. ...
Acceptance criteria vs. definition of done: What's the difference?
Software teams must understand the important distinction between acceptance criteria and definition of done and how to use them ...
Spring, Quarkus or Jakarta EE? How to choose a Java framework
Choosing a Java framework is not about which one is best, it's about accepting their tradeoffs of stability, flexibility and ...

Search Data Center

8 ways to enhance data center physical security
Data center physical security is just as important as cybersecurity. Organizations can follow these eight security approaches to ...
Benefits of edge computing over large data centers
Edge computing attracts companies by reducing latency. Its benefits over large data centers include modular design, effective ...
AWS tables Virginia data center after community pushback
The proposed 7.2 million-square-foot operation -- one of the world's largest -- would have added to Amazon's $35 billion data ...

Close