In software architecture, observability is defined as the ability to determine how a system's internal state changes in response to external outputs. To consciously build effective systems that demonstrate resilience requires a well thought-through and executed strategy for automated monitoring, alerting and recovery.
What is an observability strategy?
A successful observability strategy requires that organizations consider several key factors:
- Ingrain it as part of a development practice. The approach to observability must be consistent across systems, and to make it ubiquitous, support to bake it into new systems must be readily available. This ensures a wider coverage and unified view of systems.
- Merge monitoring, alerting, incident management and recovery. Observability combines all signals to create a holistic view of the health of the system. It goes even beyond that and provides tools to identify issues early and help recover from a failure based on past learnings.
- Make it automated and assistive. Manual intervention should be avoided where possible. As cloud-native businesses become comfortable with dynamic capacity and more mature infrastructural controls, a lot of recoveries can be automated. When they can't, it makes sense to invest in tools that create an assistive experience for teams that are responsible for uptime.
- Dampen the downstream noise. A lot of well-implemented observability setups can still lead to considerable noise. It's crucial to identify issues as close to the source as possible so teams and systems can start solving the problem right away instead of spending precious time uncovering the source of the issue.
A good strategy identifies issues before or as they occur, provides the right prioritization and alerting mechanisms and assists in recovery using automated processes or procedures that can be triggered. Finally, it must define a blueprint that's consistent across systems.
1. Determine your business goals
When it comes to customer experience, a negative experience is often more powerful than a positive one. High-quality observability is a critical part of systems that aim to build sticky user experiences. However, to define the right observability strategy, it's crucial to identify business goals first.
A good observability setup can aid in improving bottom-line revenue by optimizing spend on infrastructure, assisting growth capacity planning or improving important business metrics such as mean time to recovery. It can help establish transparency or even build a strong customer experience by providing more contextual data to the support personnel. However, the observability setup for all these goals can be very different. Identify key business objectives and then chart out an observability strategy to achieve them.
This article is part of
2. Focus on the right metrics
A well-designed observability approach makes it possible to predict the onset of a potential error or failure, and then identify where the root cause of those problems might reside -- rather than react to problematic situations as they occur. In addition to other monitoring and testing tools, a variety of data collection and analytics mechanisms play a heavy role in the quest for transparency.
For starters, a distributed systems observability plan should focus on a set of metrics called the four golden signals: latency, traffic, errors and saturation. Point-in-time metrics help track the internal state of the system, such as those garnered from an external data store that constantly scrapes state data over time. This high-level state data might not be particularly granular, but it provides a picture of when and how often a certain error occurs. The combination of this info with other data, such as event logs, makes it easier to pinpoint the underlying origin of a problem.
3. Stay on top of event logs
Event logs are a rich source of distributed system observability data for architecture and development teams. Dedicated event logging tools, such as Prometheus and Splunk, capture and record occurrences. These types of occurrences include things like successful completion of an application process, a major system failure, periods of unexpected downtime, or overload-inducing influxes of traffic.
Event logs combine timestamps and sequential records to provide a detailed breakdown of what happened -- quickly pinpoint when an incident occurred and the sequence of events that led up to it. This is particularly important for debugging and error handling because it provides key forensic information for developers to identify faulty components or problematic component interactions.
4. Provide toggle switches for tools
Comprehensive event logging processes can significantly increase a system's data throughput and processing requirements, and add troublesome levels of cardinality. Because of this, logging tools can quickly drain application performance and resource availability. They also can become unsustainable when the system's scaling requirements grow over time, which is frequently the case in complex, cloud-based distributed systems.
To strike a balance, development teams should install tool-based mechanisms that start, stop or adjust logging operations without the need to fully restart an application or update large sections of code. For example, resource-heavy debugging tools should only activate when error rates in a single system exceed a predetermined limit, rather than allow them to continuously consume application resources.
5. Perform diligent request tracing
Request tracing is a process that tracks the individual calls made to and from a respective system, as well as the respective execution time of those calls from start to finish. Request tracing information cannot contextualize, for instance, what went wrong when a request failed. However, it provides valuable information about where exactly the problem occurred within an application's workflow and where teams should focus their attention.
Like event logs, request tracing creates elevated levels of data throughput and cardinality that make them expensive to store. Again, it's important that teams only use resource-heavy request tracing tools for unusual activity or errors. In some cases, teams can use request tracing to pull individual samples of transaction histories on a regular, sequential schedule, creating an economical and resource-friendly way to continuously monitor a distributed system.
6. Create accessible data visualizations
Once a team manages to aggregate observability data, the next step is to condense the information into a readable and shareable format. Often, this is done by building visual representations of that data using tools like Kibana or Grafana. From there, team members can share that information or distribute it to other teams that also work on the application.
Such data visualization can tax a system with millions of downstream requests, but don't be overly concerned with median response times. Instead, most teams will be better served to place more focus on the number of requests that are available 95% to 99% of the time and match that number against the requirements of the SLA. This number might meet the SLA's requirements, even if it's buried under heaps of less-impressive median response time data.
7. Choose the right observability platform
At the heart of observability setup sits a log and metrics store, a querying engine and a visualization dashboard, among other components. Several independent platforms map to these capabilities. Some of them work together particularly well to create a comprehensive observability setup. However, each one of them must be picked with care to meet the specific needs of the business and the system.
In addition to the observability components, consider the demands of the system being observed in the long term. The requirements of a monolithic application can differ considerably from a distributed setup. Tools and platforms must be chosen appropriately. There are viable open source options available alongside commercial offerings.
For instance, a popular open source platform, Loki by Grafana Labs, is a log store that indexes logs against the labels. Elasticsearch, on the other hand, can decompose logs into individual fields using a log parser and transformer, such as Logstash. The performance characteristics and benefits of both tools are different with specific tradeoffs. It's cheaper to index logs in Loki, but it's easier to query logs with text data in Elasticsearch.
On the commercial side, there are a multitude of platforms, such as Honeycomb.io and Splunk, that use machine learning to predict the onset of errors with AI that can spot outliers in the data proactively.
When choosing a platform, take stock of the number of services, data volume, level of transparency and business objectives. The volume of data directly affects cost and performance, and it would be wise to pick a tool that addresses both well within the limits.
8. Establish a culture of observability
To fully realize the benefits of observability, organizations must use it to identify and solve problems proactively. This often stems from a culture of questioning, where the key metrics are identified, and mechanisms are used to obtain answers. To use observability to its fullest extent, user education and training might be required.
Once observability becomes a mindset and people start seeking answers to the right questions, the effect of observability is reinforced by itself. Answers to problems can be sought from the data. In addition to that, the data also guides the evolution and strategy of businesses and systems. A well-architected observability setup can champion this approach by making information available and visible.
9. Use AI and machine learning to augment staff capabilities
There is an increased proliferation of machine learning algorithms and AI in assistive identification of imminent failures, remedy identification and triaging of issues. Although some of these are still at a nascent stage, they can often reliably provide the required assistive support by automatically highlighting the issue that hasn't been seen before, identifying effect and severity and generating alerts.
This can mitigate errors early in the lifecycle, thereby preventing major problems ahead of time. Some offerings will mature over time, but there is value in assessing new capabilities that might offer some benefits to the teams that already heavily use these systems.
A couple of common pitfalls
Although observability can bring transparency to a system, a poorly managed approach can result in two particularly adverse effects, particularly related to alerts and data amounts.
The first of these effects is that distributed systems observability tools often generate large amounts of statistical noise. Teams can feel overwhelmed with constant alerts that might or might not require attention, and those alerts become useless if developers increasingly ignore them. As a result, critical events go undetected until complete catastrophe strikes.
The second effect is that logging and tracing efforts can take a long time if logs lack a certain level of granularity or fail to provide the situational context of an event. IT teams might be able to identify the onset of failure, but it can still be difficult and time-consuming to sort through the vast amount of contextual data needed to find the root cause of the problem. A good way to avoid this is to give developers the ability to adjust how much data individual logging tools return or disable them if needed.
Finally, an observability setup must be a fit for the platform for which it is meant. In absence of that, it can either become an overwhelming system that eats into operational cost or can be underwhelming and provides not enough visibility. The strategy also needs to articulate and identify the key questions that the setup must enable in the organization. Without that guidance, observability runs the risk of becoming a tangled web of mixed concerns that might not provide coherent and consistent user experience and support as intended.