Getty Images

Tip

5 pillars of data observability bolster data pipeline

Data observability provides holistic oversight of the entire data pipeline in an organization. Use the five pillars to ensure efficient, accurate data operations.

Peter Spotts

By

Peter Spotts, Site Editor

Published: 06 Jan 2023

Data pipelines are getting bigger and more complex as the amount of data organizations collect and analyze continues to grow. Data observability tools aid these efforts by monitoring potential issues throughout the pipeline and alerting data teams about necessary interventions.

Data observability is a tool that provides organizations with end-to-end oversight of the entire data pipeline and monitors the overall health of the system. If it identifies any errors or issues, the software alerts the right people within the organization to the area that needs addressing.

"Analytics and data projects are very heavily dependent on the data preparation and pipeline processes," said David Menninger, senior vice president and research director at Ventana Research. "We have enough info to know what the data should look like. If the current run of that pipeline doesn't match that shape, that's an indication that maybe there's an issue."

Data observability is a relatively new aspect of the data marketplace and has been growing in prominence over the past four years. Data vendor staples, such as Monte Carlo, are designing data observability tools, and new vendors are also emerging as the importance of monitoring pipeline health increases.

Data observability is the data spinoff of observability, which organizations use to keep track of the most important issues in a system. The three core pillars of observability are metrics, logs and traces:

Metrics. A numerical representation of data.
Logs. Records of events, typically in text or readable form. Logs can come from infrastructure elements or software. Logs are typically historical or retrospective, but some offer capabilities for real-time event collection or telemetry data.
Traces. Traces of the information pathways or workflows throughout every process for a request or action, such as a transaction.

Data observability focuses on five of its own pillars to make metric, data and trace management more effective and improve overall data quality.

5 pillars of data observability

Each pillar covers a different aspect of the data pipeline and complements the other four pillars. The five pillars of data observability are the following:

freshness
distribution
volume
schema
lineage

1. Freshness

Freshness tracks how up to date the data is and the frequency data is updated. Freshness is one of the most requested forms of monitoring that data observability platform Bigeye has from its customers, said Kyle Kirwan, CEO and co-founder of Bigeye.

"If I have a pipeline and it's feeding an analytics dashboard, the first question is -- the data is supposed to be reset every six hours -- is it refreshing on time, or is it delayed?" he said.

Confirming data is both updated and coming in at the appropriate rate is particularly useful when it comes to data governance and data catalogs.

"[Freshness] is probably the most closely reported; it's a gross indicator of whether there's an issue," Menninger said. "Data catalogs have become very popular … and typically report freshness."

Using the wrong numbers or not knowing why numbers negatively impact machine learning models undermines data-driven decision-making.

Using a data observability tool that can automate the process of checking data freshness can free up vital staff hours and save costs. The car buying service Peddle LLC created an internal process to facilitate and automate metric creations, of which they have several thousand to monitor.

"Freshness is the big one. It's the one we rely on a lot. It's telling us, 'Hey, did your ETL [extract, transform and load] work when it was supposed to?'" said Tim Williamson, senior data warehouse engineer at Peddle, which uses Bigeye to improve data quality. "If we had to build this out by hand, I'd hate to think how much time it would have taken."

Chart depicting the five pillars of data observability: freshness, distribution, volume, schema and lineage — The five pillars of data observability

2. Distribution

Distribution is the expected values of data organizations collect. If data doesn't match the expected values, it can be an indication there's an issue with the reliability of the data. Extreme variance in data also indicates accuracy concerns.

Data observability tools can monitor data values for errors or outliers that fall outside the expected range. They can alert the appropriate parties about inconsistencies to address issues quickly before those issues affect other parts of the pipeline.

Data quality is an essential part of the distribution pillar because poor quality can cause the issues that distribution monitors for. Inaccurate data -- either erroneous or missing fields -- getting into the pipeline can cascade through different parts of the organization and undermine decision-making. Distribution helps address problematic elements if the data observability tool detects poor quality.

3. Volume

Volume tracks the completeness of data tables and, like distribution, offers insights into the overall health of data sources. Organizations know where they collect data from, what time periods the data is gathered during, and information about products, accounts and customer data, Menninger said. For example, if a table has the 50 states of the United States in it and it reduces down to 25 states, something is wrong.

Volume is one of the main focuses for Williamson.

"Right now, the primary we're using are things like row counts -- [asking], 'Are the row counts different than historically?' Looking for duplicate primary keys," he said.

It's great to observe an issue. That doesn't do you much good by itself. Observability needs to include remediation.

David MenningerSenior vice president and research director, Ventana Research

4. Schema

Schema monitors the organization of data, such as data tables, for breaks in the table or data. Rows or columns can be changed, moved, modified or deleted and disrupt data operations. The larger the databases in use, the more difficult it can be for data teams to pin down where the break could be.

"Schema is an indicator your pipelines need to be modified," Menninger said. "The information could be used for some amount of automated remediation."

Bigeye has built-in format and schema ID checks, numeric values and outlier distributions covered by more than 70 data quality monitors. When using Snowflake Information Schema tables to identify dbt tests for tables with row counts greater than zero, Bigeye helps Williamson monitor the view and identify failed tests immediately. This enables faster reaction to issues as they arise, Williamson said.

"Bigeye has notified me row counts were out of whack for this particular table," he said. "We found a bug in our ETL pipeline code. We were able to fix it. … Bigeye has on several occasions alerted me to problems that were legitimate problems."

5. Lineage

Lineage is the largest of the pillars because it encompasses the entirety of the data pipeline. It maps out the data pipeline, where the sources come from, where the data goes, where it's stored, when it's transformed and the users it's distributed to.

Lineage enables an organization to look at each step of the data pipeline and how it links from one to the other, revealing how issues in one part impact other areas. When there's an error within the pipeline, lineage provides the holistic view for IT or the data team to look at the big picture and trace not only where the problem is, but also where it originated and what it has affected.

"Lineage is a harder one to assess," Menninger said. "If the type of transformation changes, is that an issue? It may or may not be. There may be a new product that's introduced, so lineage process needs to be modified to recognize the new product code."

In some cases, lineage can backtrace an issue from further down the pipeline back up to the source in a different part of the pipeline. It can also help track infrastructure costs and answer documentation questions, as well as identify who is using the data and its force multiplier.

Observability is a valuable tool for organizations to catch issues, but catching the issue is only half the battle. Any alert should start the process of remedying the issue.

"It's great to observe an issue; that doesn't do you much good by itself," Menninger said. "Observability needs to include remediation. We should be able to automate some of the remediation."

Dig Deeper on Data integration

Part of: Data observability boosts data pipeline performance

Up Next

5 pillars of data observability bolster data pipeline

Data observability provides holistic oversight of the entire data pipeline in an organization. Use the five pillars to ensure efficient, accurate data operations.

Data observability benefits entire data pipeline performance

Data observability benefits include improving data quality and identifying issues in the pipeline process, but also has challenges organizations must solve for success.

6 data observability open source tools to consider

Learn about six data observability open source options helping organizations pursue data science experiments that are more budget-friendly and flexible than commercial tools.

7 expert recommended data observability tools

Commercial data observability tools can offer organizations pre-built components and plenty of vendor support for data use cases including monitoring, security and decision-making.

Search Business Analytics

Master these skills to get the right data scientist role
Data science offers many professional opportunities. Balance education and experience to present yourself as an adaptable and ...
Best practices for using simulation models in business
Simulation models provide businesses with a framework for forecasting and strategy through tested practices in finance, ...
Incorta unveils features as part of new focus on AI
The vendor's new AI-focused features include a Model Context Protocol layer for developing agents and prebuilt agentic AI ...

Search AWS

Compare Datadog vs. New Relic for IT monitoring in 2024
Compare Datadog vs. New Relic capabilities including alerts, log management, incident management and more. Learn which tool is ...
AWS Control Tower aims to simplify multi-account management
Many organizations struggle to manage their vast collection of AWS accounts, but Control Tower can help. The service automates ...
Break down the Amazon EKS pricing model
There are several important variables within the Amazon EKS pricing model. Dig into the numbers to ensure you deploy the service ...

Search Content Management

How to create a digital signature in Adobe, Preview or Word
Business executives can use different tools and methods to get digital signatures to close deals, but some important security ...
How to conduct a content audit: Step-by-step with template
To conduct a content audit, enterprises should follow eight comprehensive steps to improve their content performance, audience ...
Contentsquare previews agentic AI-driven content analytics
Content analytics can be complicated, but natural language and agentic AI can simplify the process.

Search Oracle

Click-to-launch tools pull apps through Oracle Cloud Infrastructure marketplace
Oracle has made it easier for customers to choose and launch third-party software onto its cloud. Now, the question is whether ...
Willis develops app to put a personal touch back in voluntary benefits
Part two of a two-part article: Willis uses PeopleSoft 9.1 to bring back the personal feel to automated insurance selection for ...
Willis develops app for real-time voluntary benefit selection
Part one of a two-part article: Willis uses PeopleSoft 9.1 to create real-time automated insurance selection for voluntary ...

Search SAP

SAP pitches role-based Joule assistants as ERP work partners
New AI-driven applications for supply chain, procurement and CX also shared the spotlight as SAP strives to portray its broad ...
There are '50 shades of clean core' for SAP customers
In this Q&A, Michael Lemashov and Denis Malov of JDC Group discuss the strategies for SAP customers to achieve a clean core and ...
AI tackles customizations in SAP clean core migrations
New tools from Lemongrass and Kyndryl are designed to find custom code and figure out what to do with it when moving to a clean ...

Close