In software architecture, observability is defined as the ability to determine how a system's internal state changes in response to external outputs. Most complex and highly distributed application systems emit measurable signals that gauge the internal back-end state of a system, as well as the impact of external inputs on that state. With the right mix of monitoring, logging, documentation and visualization tools, software teams can assemble an airtight distributed systems observability strategy.
However, teams must apply these tools in a way that provides adequate transparency but doesn't waste resources, dampen application performance or hamper development operations. This requires adherence to some basic observability guidelines and practices. Let's review some of the important metrics to monitor, ways to maintain efficient event logs, practical observability tooling approaches, effective visualization strategies and, finally, some of the pitfalls to watch out for.
Focus on the right metrics
A well-designed observability approach makes it possible to predict the onset of a potential error or failure, and then identify where the root cause of those problems might reside -- rather than react to problematic situations as they occur. In addition to other monitoring and testing tools, a variety of data collection and analytics mechanisms play a heavy role in the quest for transparency.
For starters, an distributed systems observability plan should focus on a set of metrics called the four golden signals: latency, traffic, errors and saturation. Point-in-time metrics help track the internal state of the system, such as those garnered from an external data store that constantly scrapes state data over time. This high-level state data might not be particularly granular, but it provides a picture of when and how often a certain error occurs. The combination of this info with other data, such as event logs, makes it easier to pinpoint the underlying origin of a problem.
Stay on top of event logs
Event logs are a rich source of distributed system observability data for architecture and development teams. Dedicated event logging tools, such as Prometheus and Splunk, capture and record occurrences. These types of occurrences include things like a successful completion of an application process, a major system failure, periods of unexpected downtime or overload-inducing influxes of traffic.
Event logs combine timestamps and sequential records to provide a detailed breakdown of what happened -- quickly pinpoint when an incident occurred and the sequence of events that led up to it. This is particularly important for debugging and error-handling, since it provides key forensic information for developers to identify faulty components or problematic component interactions.
Provide toggle switches for tools
Comprehensive event logging processes can significantly increase a system's data throughput and processing requirements, and add troublesome levels of cardinality. Because of this, logging tools can quickly drain application performance and resource availability. They also can become unsustainable when the system's scaling requirements grow over time, which is frequently the case in complex, cloud-based distributed systems.
To strike a balance, development teams should install tool-based mechanisms that start, stop or adjust logging operations without the need to fully restart an application or update large sections of code. For example, resource-heavy debugging tools should only activate when error rates in a single system exceed a predetermined limit, rather than allow them to continuously consume application resources.
Perform diligent request tracing
Request tracing is a process that tracks the individual calls made to and from a respective system, as well as the respective execution time of those calls from start to finish. Request tracing information lacks the ability to contextualize, for instance, what went wrong when a particular request failed. However, it provides valuable information about where exactly the problem occurred within an application's workflow and where teams should focus their attention.
Like event logs, request tracing creates elevated levels of data throughput and cardinality that make them expensive to store. Again, it's important that teams only use resource-heavy request tracing tools with unusual activity or errors. In some cases, teams can use request tracing to pull individual samples of transaction histories on a regular, sequential schedule, creating an economical and resource-friendly way to continuously monitor a distributed system.
Create accessible data visualizations
Once a team manages to aggregate observability data, the next step is to condense the information into a readable and shareable format. Often, this is done by building visual representations of that data using tools like Kibana or Grafana. From there, team members can share that information among each other or distribute it to other teams who also work on the application.
Such data visualization can tax a system with millions of downstream requests, but don't be overly concerned with median response times. Instead, most teams will be better served to place more focus on the number of requests that are available 95% to 99% of the time, and match that number against the requirements of the SLA. It's entirely possible that this number may meet the SLA's requirements, even if it's buried under heaps of less-impressive median response time data.
A couple common pitfalls
While observability can bring transparency to a system, a poorly managed approach can result in two particularly adverse effects, particularly related to alerts and data amounts.
The first of these effects is that distributed systems observability tools often generate large amounts of statistical noise. Teams may feel overwhelmed with constant alerts that may or may not require attention, and those alerts become useless if developers increasingly ignore them. As a result, critical events go undetected until complete catastrophe strikes.
The second effect is that logging and tracing efforts can take a long time if logs lack a certain level of granularity or fail to provide the situational context of an event. IT teams may be able to identify the onset of failure, but it can still be difficult and time-consuming to sort through the vast amount of contextual data needed to find root cause of the problem. Again, the solution is to give developers the ability to adjust how much data individual logging tools return or disable them if needed.