Data pipelines are getting bigger and more complex as the amount of data organizations collect and analyze continues to grow. Data observability tools aid these efforts by monitoring potential issues throughout the pipeline and alerting data teams about necessary interventions.
Data observability is a tool that provides organizations with end-to-end oversight of the entire data pipeline and monitors the overall health of the system. If it identifies any errors or issues, the software alerts the right people within the organization to the area that needs addressing.
"Analytics and data projects are very heavily dependent on the data preparation and pipeline processes," said David Menninger, senior vice president and research director at Ventana Research. "We have enough info to know what the data should look like. If the current run of that pipeline doesn't match that shape, that's an indication that maybe there's an issue."
Data observability is a relatively new aspect of the data marketplace and has been growing in prominence over the past four years. Data vendor staples, such as Monte Carlo, are designing data observability tools, and new vendors are also emerging as the importance of monitoring pipeline health increases.
Data observability is the data spinoff of observability, which organizations use to keep track of the most important issues in a system. The three core pillars of observability are metrics, logs and traces:
- Metrics. A numerical representation of data.
- Logs. Records of events, typically in text or readable form. Logs can come from infrastructure elements or software. Logs are typically historical or retrospective, but some offer capabilities for real-time event collection or telemetry data.
- Traces. Traces of the information pathways or workflows throughout every process for a request or action, such as a transaction.
Data observability focuses on five of its own pillars to make metric, data and trace management more effective and improve overall data quality.
5 pillars of data observability
Each pillar covers a different aspect of the data pipeline and complements the other four pillars. The five pillars of data observability are the following:
Freshness tracks how up to date the data is and the frequency data is updated. Freshness is one of the most requested forms of monitoring that data observability platform Bigeye has from its customers, said Kyle Kirwan, CEO and co-founder of Bigeye.
"If I have a pipeline and it's feeding an analytics dashboard, the first question is -- the data is supposed to be reset every six hours -- is it refreshing on time, or is it delayed?" he said.
Confirming data is both updated and coming in at the appropriate rate is particularly useful when it comes to data governance and data catalogs.
"[Freshness] is probably the most closely reported; it's a gross indicator of whether there's an issue," Menninger said. "Data catalogs have become very popular … and typically report freshness."
Using the wrong numbers or not knowing why numbers negatively impact machine learning models undermines data-driven decision-making.
Using a data observability tool that can automate the process of checking data freshness can free up vital staff hours and save costs. The car buying service Peddle LLC created an internal process to facilitate and automate metric creations, of which they have several thousand to monitor.
"Freshness is the big one. It's the one we rely on a lot. It's telling us, 'Hey, did your ETL [extract, transform and load] work when it was supposed to?'" said Tim Williamson, senior data warehouse engineer at Peddle, which uses Bigeye to improve data quality. "If we had to build this out by hand, I'd hate to think how much time it would have taken."
Distribution is the expected values of data organizations collect. If data doesn't match the expected values, it can be an indication there's an issue with the reliability of the data. Extreme variance in data also indicates accuracy concerns.
Data observability tools can monitor data values for errors or outliers that fall outside the expected range. They can alert the appropriate parties about inconsistencies to address issues quickly before those issues affect other parts of the pipeline.
Data quality is an essential part of the distribution pillar because poor quality can cause the issues that distribution monitors for. Inaccurate data -- either erroneous or missing fields -- getting into the pipeline can cascade through different parts of the organization and undermine decision-making. Distribution helps address problematic elements if the data observability tool detects poor quality.
Volume tracks the completeness of data tables and, like distribution, offers insights into the overall health of data sources. Organizations know where they collect data from, what time periods the data is gathered during, and information about products, accounts and customer data, Menninger said. For example, if a table has the 50 states of the United States in it and it reduces down to 25 states, something is wrong.
Volume is one of the main focuses for Williamson.
"Right now, the primary we're using are things like row counts -- [asking], 'Are the row counts different than historically?' Looking for duplicate primary keys," he said.
David MenningerSenior vice president and research director, Ventana Research
Schema monitors the organization of data, such as data tables, for breaks in the table or data. Rows or columns can be changed, moved, modified or deleted and disrupt data operations. The larger the databases in use, the more difficult it can be for data teams to pin down where the break could be.
"Schema is an indicator your pipelines need to be modified," Menninger said. "The information could be used for some amount of automated remediation."
Bigeye has built-in format and schema ID checks, numeric values and outlier distributions covered by more than 70 data quality monitors. When using Snowflake Information Schema tables to identify dbt tests for tables with row counts greater than zero, Bigeye helps Williamson monitor the view and identify failed tests immediately. This enables faster reaction to issues as they arise, Williamson said.
"Bigeye has notified me row counts were out of whack for this particular table," he said. "We found a bug in our ETL pipeline code. We were able to fix it. … Bigeye has on several occasions alerted me to problems that were legitimate problems."
Lineage is the largest of the pillars because it encompasses the entirety of the data pipeline. It maps out the data pipeline, where the sources come from, where the data goes, where it's stored, when it's transformed and the users it's distributed to.
Lineage enables an organization to look at each step of the data pipeline and how it links from one to the other, revealing how issues in one part impact other areas. When there's an error within the pipeline, lineage provides the holistic view for IT or the data team to look at the big picture and trace not only where the problem is, but also where it originated and what it has affected.
"Lineage is a harder one to assess," Menninger said. "If the type of transformation changes, is that an issue? It may or may not be. There may be a new product that's introduced, so lineage process needs to be modified to recognize the new product code."
In some cases, lineage can backtrace an issue from further down the pipeline back up to the source in a different part of the pipeline. It can also help track infrastructure costs and answer documentation questions, as well as identify who is using the data and its force multiplier.
Observability is a valuable tool for organizations to catch issues, but catching the issue is only half the battle. Any alert should start the process of remedying the issue.
"It's great to observe an issue; that doesn't do you much good by itself," Menninger said. "Observability needs to include remediation. We should be able to automate some of the remediation."