What is data observability?
Data observability is a process and set of practices that aim to help data teams understand the overall health of the data in their organization's IT systems. Through the use of data observability techniques, data management and analytics professionals can monitor the quality, reliability and delivery of data and identify issues that need to be addressed.
The concept of data observability was first described by Barr Moses, co-founder and CEO of software vendor Monte Carlo Data. Moses coined the term in 2019, when she wrote a blog post about applying the general principles of observability for IT systems to data.
Observability, a central tenet of DevOps processes, enables IT teams to view the current state of systems and pinpoint the cause of performance problems. In a similar way, data observability provides increased visibility into complex data environments and helps data management teams create more robust and reliable data sets and data pipelines.
Why data observability can help organizations
In the past, organizations had limited ways to store and analyze data. Now, though, they commonly generate large amounts of data in different systems and collect even more from outside data sources. To process, store and manage all of that data from various sources, they often build multiple data repositories -- data warehouses, data marts, data lakes and more. Different data pipelines are then needed to distribute the data to operational systems and growing numbers of end users, who analyze it to glean information and gain insights.
As a result, the typical enterprise data environment is growing both in size and complexity, which makes ensuring data quality and reliability increasingly difficult. Data quality monitoring tools can be used to identify problems in data sets, but they only recognize specific predefined issues and provide a limited view of overall data health. That's where data observability comes in: It's designed to help ensure healthy data pipelines, high data reliability and timely data delivery and to eliminate any data downtime, when errors, missing values or other issues make data sets unusable.
What is data observability's role in DataOps processes?
Data observability dovetails with DataOps, a growing practice used by teams of data engineers, data scientists, data analysts and other data professionals.
Like data observability, DataOps was adapted from software development and IT processes. The principles of DataOps come from DevOps, which combines software developers and IT operations staffers into a single team that uses Agile development methodologies to ensure that new software code can run efficiently and effectively when it's deployed. Similarly, DataOps calls for using Agile approaches to design, implement and maintain an organization's data architecture and data pipelines.
The goal of DataOps is to ensure that data is complete and accurate and can be accessed by the users who need it when they need it. This in turn can help ensure that an organization gets the maximum business value from its data. DataOps involves continuous integration and continuous delivery (CI/CD) and testing, orchestration and monitoring of data assets and pipelines.
That list of tasks can now also include data observability. As part of DataOps processes, data engineers and other data professionals can use data observability techniques and tools to do the following:
- monitor data sets, whether they're at rest or in motion;
- confirm that data formats, types and value ranges are as expected;
- check for any anomalies, such as schema changes or sudden changes to data values, that could indicate a problem requiring remediation; and
- track and optimize the performance of data pipelines.
In a survey of data professionals in the U.S. and Canada on DataOps practices, done in 2022 by TechTarget's Enterprise Strategy Group (ESG) division, 75% of the 403 respondents said data observability plays a very important role in their organization's DataOps initiatives and another 15% said it's a critical part of those efforts. ESG analyst Mike Leone wrote in a report about the survey published in August 2022 that data observability is "the next evolution of data quality" and that it makes effective DataOps possible by enabling organizations to better understand the state and health of their data.
The 5 pillars of data observability
Data observability also borrows the idea of key pillars from general IT observability, which is based on three: logs, metrics and traces. Data observability, as outlined by Moses, has five pillars that are meant to work in concert to provide insights into the quality and reliability of an organization's data.
These are the five pillars and what each one contributes to the process:
- Freshness. This involves confirming that data is up to date and is being updated as required, thereby identifying any gaps in time when it hasn't been properly updated and helping to prevent freshness or timeliness issues in data pipelines.
- Distribution. This focuses on confirming that data values are within accepted ranges by measuring items -- such as null values and abnormal representation of expected values -- that are indicators of a data set's health at the field level.
- Volume. This is an assessment of whether data sets are complete -- for example, checking if data tables contain the right number of rows and columns to detect possible problems in source systems.
- Schema. This involves monitoring and auditing changes to how data tables are organized, an important component because schema changes are often a sign of broken data and the cause of data downtime incidents.
- Lineage. This is the process of documenting and understanding the organization's full data landscape, including upstream data sources, downstream target systems and who interacts with data and at what stages. The data lineage process enables data teams to more quickly pinpoint where there's a break in the data when problems arise. Moreover, because data lineage involves the collection of metadata, it helps support data governance programs.
Benefits of data observability
According to its proponents, data observability gives data teams better visibility into their data ecosystems. That improved visibility in turn can deliver the following benefits:
- More efficient improvements to data sets. Data observability enables data teams to spot issues in data sets and analyze their root cause faster than they could before, with less effort required. They also have a better chance of finding new types of issues.
- A more reliable and resilient data environment. Data observability can surface potential issues and help remedy them before they become a problem. It also enables more effective troubleshooting of problems when they do occur and delivers contextual information for planning and prioritizing remediation efforts.
- Minimized data downtime. Through data observability, data teams can detect and evaluate data downtime incidents in real time and take corrective actions immediately. For example, data observability platforms that use artificial intelligence (AI) and machine learning to detect anomalies and other issues can alert teams to the need for fixes to make the affected data usable again.
At a higher level, data observability can benefit an organization by ensuring that the data used to drive operational and strategic decision-making is accurate, complete, reliable and trustworthy.
Data observability challenges
An effective data observability initiative requires a DataOps culture and data professionals who are willing to learn, implement and follow the prescribed practices. It also requires a set of tools, potentially including a data observability platform, to support those efforts.
Even then, organizations often face the following challenges when adopting a data observability approach:
- A lack of organizational support. Even if data teams want to adopt data observability practices and tools, they typically need funding and executive support to implement the required elements. Getting that support could be difficult in some organizations.
- Siloed and inaccessible data. The continuing presence of standalone data silos and data stored in spreadsheets or other hidden repositories prevents data teams from gaining end-to-end visibility of an organization's data assets.
- Not fully integrating data systems into an observability platform. Even if data silos aren't a factor, connecting all of the systems in an organization to a data observability tool might not be easy or doable. If so, there also will be no continuous and comprehensive view of data pipelines.
- Failure to integrate data observability into a broader data management program. Data observability doesn't replace or supplant other components of a data management program, such as data quality management and master data management (MDM). Instead, it needs to work with them to be successful.
Data observability vs. data quality
Data quality measures whether the condition of data sets is good enough for their intended uses in operational and analytics applications. To make that determination, data is examined based on various dimensions of quality, such as accuracy, completeness, consistency, validity, reliability and timeliness.
Data observability supports data quality, but the two are different aspects of managing data. While data observability practices can point out quality problems in data sets, they can't on their own guarantee good data quality -- that requires efforts to fix data issues and to prevent them from occurring in the first place. On the other hand, an organization can have strong data quality even if it doesn't implement a data observability initiative.
Data observability vs. data governance
Similarly, data observability and data governance might seem synonymous at first glance, but they're also complementary processes that support each other.
Data governance aims to ensure that an organization's data is available, usable, consistent and secure and that it's used properly, in compliance with internal standards and policies. Governance programs often incorporate or are closely tied to data quality improvement efforts.
A strong data governance program helps eliminate the data silos, data integration problems and poor data quality that can limit the value of data observability practices. In turn, data observability can aid the governance program by monitoring changes in data quality, availability and lineage.
Data observability vendors and tools
Data observability software is still an emerging product category -- Gartner just added it to the consulting firm's annual Hype Cycle report on new data management technologies in 2022, saying it could take another five to 10 years to become fully mature. But more than a half-dozen startup vendors now offer commercial data observability tools, including the following companies:
- Monte Carlo Data
The technology has also caught the attention of some larger data management vendors. IBM acquired Databand in July 2022 and now operates it as a subsidiary, while vendors such as Collibra and Precisely have added data observability features to their data management tools. In addition, several open source technologies for data observability are available to use.
A general list of key features built into data observability tools includes the following items:
- end-to-end visibility of data environments and data stacks;
- data pipeline and data quality monitoring to detect performance problems and data issues;
- anomaly detection capabilities driven by machine learning;
- automated alerting on data issues, with customizable notification capabilities;
- data lineage information to track data flows and how problems affect them;
- root cause analysis to identify the underlying reason for data issues;
- analysis of data usage to help prioritize monitoring of data assets; and
- triage functions to help plan remediation efforts and assign projects.