Analysts and end users have sought data observability for years, but a recent shift has changed how business processes use these tools, leaving organizations with plenty to consider when selecting which tool is best to use and if commercial investment is worth it.
Observability tools have traditionally focused on capturing and analyzing log data to improve application performance monitoring and security. Data observability turns the focus back on the data to improve data quality, tune data infrastructure and identify problems in data engineering pipelines and processes.
"Data analysts and business users are the primary consumers of this data," said Steven Zhang, director of engineering at Hippo Insurance. "But it's becoming increasingly common that data engineers, who produce this data alongside product engineers, are also struggling with it."
This calls into question the trustworthiness of the data in terms of accuracy, reliability and freshness. This is where data observability tools come into play.
A good data observability tool captures these problems and presents them in a clean structure. It helps consumers understand conceptually where the data went wrong and helps engineers identify the root causes.
Why choose a commercial data observability tool?
There are many open source and commercial tools for organizations implementing data observability workflows. Commercial tools can fast-track this process with pre-built components for common workflows and offer plenty of vendor support. They also include better support for important enterprise use cases like data quality monitoring, security and improved decision-making.
"A modern data infrastructure is often a combination of best-in-class but disjointed set of software environments that requires to be monitored and managed in a unified manner," said Sumit Misra, vice president of data engineering LatentView Analytics, an analytics consultancy.
For example, when a data job fails in an environment, another seemingly unrelated data environment must know and react to the job's failure. Observable, responsive and self-treating data flows are becoming essential for businesses.
Commercial data observability tools can help organizations accelerate their time to deliver value from data quality initiatives, particularly when they are small or employ more business talent than IT talent, Misra said.
What to look for in a data observability tool
Enterprises often end up deploying more tools than required or incorporating tools that are not specific or relevant to their business cases.
"Investments in commercial data observability tools and initiatives need to be made from the perspective of the overall business, internal users and customers," said Alisha Mittal, a vice president in IT services at Everest Group.
Alisha MittalVice president at Everest Group
More tools do not always mean higher visibility. In fact, at times, these tools increase the system's complexity. Enterprises should strategically invest in observability tools by examining their current architecture, IT operations landscape and the skill development training and hiring required to handle the tools.
Various data quality and security functions are conventionally performed by the data teams of an organization. However, the value of data observability tools lies in how these activities fit into the end-to-end data operations workflow and the level of context they provide on data issues.
Enterprises should consider how different data observability functions align with the following data quality management processes, Mittal said:
- Monitoring offers a functional perspective of enterprise data systems or pipelines.
- Alerting produces alerts/notifications both for expected events and anomalies.
- Tracking provides the ability to set and track specific data-related events.
- Logging keeps a record of events in a consistent way to facilitate quicker resolution.
- Analysis involves an issue detection mechanism that provides insight on data pipeline and logs.
Top commercial data observability tools
Here are some of the top commercial data observability tools based on interviews with experts and users. These top choices are good at addressing enterprise considerations around investment, implementation and viability, Mittal said.
They also include a well-defined value-based approach that aligns to business goals like operational efficiency or cost savings and have pre-built tool stacks to help enterprises realize immediate value.
These tools focus on data observability specifically and are seeing enterprise adoption. Other traditional observability tools might eventually expand into this space in the future.
Acceldata has various tools to provide data observability for the cloud, Hadoop and the enterprise. Top capabilities include data pipeline monitoring, end-to-end data reliability and quality, multilayer data observability, extensive cloud capabilities, compute and cost optimization, and rapid setup.
Acceldata is good for acquiring thorough, cross-functional visibility into complicated, frequently interrelated data systems, Mittal said. This makes it the preferred observability tool in the payments and financial sector. It also excels in combining signals from many tasks and layers on a single pane of glass. This allows different teams to collaborate more efficiently.
One caveat is these tools might not be preferable for enterprises using many different external monitoring tools, Mittal said.
Databand is an IBM company providing a data observability platform to help teams detect and resolve data issues. One top feature is support for proactive capabilities to help detect data and resolve data incidents earlier in the development cycle. It includes tools for collecting metadata, profiling behavior, detecting and alerting on data incidents, and triaging data quality issues.
It supports incident management, lineage, reliability monitoring in production and data quality metrics. It also supports an open source library to help engineers build custom extensions. IBM acquired the company earlier in 2022 and it can be a good choice for companies with an extensive IBM infrastructure.
Datafold focuses more on the data engineering development pipeline than other tools. It supports automated testing capabilities, allowing data engineers to test quality and performance issues for new data workflows before pushing them into production without needing to write SQL tests.
It also supports alerting capability for tracking data quality issues in production as well. Engineers can also explore how data flows through models. Teams can assess the impact of data model changes on existing reports.
Monte Carlo provides a comprehensive data observability capability. It's an end-to-end platform focusing on fixing faulty data pipelines, Mittal said. It helps engineers ensure dependability and troubleshoot issues before they cause an outage.
"Monte Carlo offers comprehensive functionality that covers the different observability needs of most organizations at a reasonable price and features a very entertaining Twitter account," Zhang said.
Top features include data catalogs, automated alerting and observability on several criteria. In addition, it ensures business data never leaves the enterprise network. It also supports a fully automated setup.
One caveat is some Monte Carlo clients experience UI problems, mainly when working with high volumes of data, Mittal said. Monte Carlo is ideal for data engineers and analytics teams that wish to avoid expensive downtime.
Precisely provides a data catalog and data integrity suite. Its data observability capabilities help teams identify impacts from adverse data integrity issues. It focuses on ease of use, intelligent analysis that minimizes alert fatigue, and interoperability with modern tech stacks and data infrastructure. It extensively uses AI and ML capabilities to identify issues and root causes.
It links data observability capabilities into a data catalog, helping teams find and integrate data into new workflows. This improves the quality of data in workflows involving enrichment with business, location or consumer data.
Soda is an AI-powered data observability platform. It also includes extensive collaboration capabilities to help data owners, data engineers and data analytics teams work through issues.
This platform targets sophisticated data consumers. Enterprises can rapidly examine enterprise data right away, define rules to test and validate data and respond programmatically anytime a test fails.
For instance, enterprises can immediately halt data operations and quarantine data. The Soda SQL command-line tool can also scan data and display the Soda SQL results.
The community support is not as strong as some other tools that have been around for longer, Mittal said.
Unravel is a DataOps platform supporting AI optimization and automated governance. A recommendation engine can tell teams how to fix the root cause of many data problems. For example, if a container in one part of a job is improperly sized, Unravel can recommend the proper settings.
It also stitches together telemetry and metadata from a multitude of components, assets, technologies and processes across the full data stack. This gives data teams a unified, in-depth understanding of the behavior, performance, cost and health of their data and data workflows.
AI-cost governance features help data teams analyze when data pipelines run on more expensive cloud instances than necessary and recommend adjustments that don't affect service requirements. It's also good at stitching together details from configurations, infrastructure and layout.