What is a data mesh?
Data mesh is a decentralized data management architecture for analytics and data science. The term was coined by Zhamak Dehghani while at the consultancy Thoughtworks to help address some of the fundamental shortcomings in traditional centralized architectures like data warehouses and data lakes. Beyond traditional architectures, modern streaming analytics approaches embrace cloud services and support real-time data, unified batch processing and stream processing.
These traditional and modern data management methods tend to create infrastructure bottlenecks in the data preparation process, observed Dehghani, now CEO and founder of a stealth startup. She also said the goals of data preparers don't align with those of data users mainly because the emphasis is on data infrastructure instead of data use. A more efficient method, she argued, is to give business teams the appropriate infrastructure to create their own data products that support quality, security, privacy, governance and performance. The assumption is that business teams creating and using the data have greater incentive to ensure the data is trustworthy and performant.
That means data engineering teams would have to shift their focus from engineering-specific workflows to enable data teams to provision their own data sets. This distributed data mesh approach can help data scientists, business users and developers weave the data into new analytics, data science, machine learning and AI products with appropriate guardrails.
All the technology required to create and manage a data mesh is already available, Dehghani said. But enterprises would need to change their focus and adopt new workflows. This can be challenging for data experts traditionally focused on data infrastructure instead of creating domain-specific data products.
Modern data infrastructure is composed of operational and analytical data. Data mesh focuses on the management and access to analytical data at scale. Existing operational data management architectures are relatively mature and generally meet enterprise needs for scalability, Dehghani surmised, although there's room for improvement.
Some enterprises have started to implement a data mesh, including Intuit, Netflix, Roche, Saxo Bank, Vistaprint, Zalando and the U.S. Department of Veterans Affairs.
4 core principles of data mesh architecture
Dehghani advocates four core principles that underlie data mesh architecture for analytics and data science applications.
1. Domain-oriented data ownership and architecture
A data mesh builds on author Eric Evans's theory of domain-driven design that explores how to deconstruct applications into distributed services aligned around business capabilities. But instead of thinking only about services, data teams also need to host and serve domain data sets in a way that's easily consumed by others across the organization. Rather than push and ingest data, these teams need to think about how to de-centrally host data that can be pulled by different users.
The core principle is that data should be the responsibility of the business teams closest to the data. Domain teams should have access to tools that create analytics data, its metadata and all the computations required to serve it.
2. Data as a product
The software industry has been transitioning from project management to product management. A data mesh applies the same concept to data products. Domain experts must focus on improving various aspects of these data products, such as data quality, lead time of data consumption and user satisfaction.
A fundamental principle is that accountability shifts as close to the data source as possible rather than to a data engineering team that may be less familiar with how the data was collected, what it means and how it might be used. Data engineering teams need to focus on implementing the infrastructure that works across domains so it's easier to create and manage these products through capabilities like discoverability, explorability, security, trustworthiness and understandability.
A data product is built on several structural components, including the following:
- The code supports the data pipelines, APIs that access the data and access control policies.
- The data can consist of events, tables, batch files or graphs, while the metadata describes what it means.
- The infrastructure helps physically provision and manage the data.
3. Self-service data platform
Business teams aren't data engineers, nor should they be. Data engineers must build the appropriate infrastructure to provide these domain experts with domain autonomy. This infrastructure might take advantage of existing data platforms and tools, but it also needs to support self-service provisioning capabilities for data products that are accessible to a broader audience. These users should be able to work with data storage formats, create data product schemas, set up data pipelines, manage data product lineage and automate governance.
One approach is to set up a multiplane data platform analogous to the different planes in network routing. A data infrastructure provisioning plane helps set up the underlying infrastructure. A data product developer experience plane simplifies development workflows with tools to create, read, version, secure and build data products. A data mesh supervision plane helps implement new services across the infrastructure for things like discovering data products or correlating multiple data products together.
4. Federated computational governance
A data mesh needs a decentralized governance model that can automate the execution of decisions across the platform. This model ensures interoperability across the different data sources. It can also help correlate, join and perform other operations across multiple data products at scale.
That differs from traditional data governance approaches for analytics that try to centralize all decision-making. Each domain is responsible for some decisions, such as the domain data model and quality assurance. A centralized data engineering team shifts its focus to automating many aspects of governance, such as implementing tools to detect and recover from errors, automate processes and establish service-level objectives for the enterprise.
Potential benefits of implementing a data mesh
A data mesh architecture provides several benefits, including the following:
- improves the experience for teams using and consuming data;
- lowers the costs of setting up new analytics and data science products;
- unlocks new trend analysis and business intelligence use cases;
- democratizes access to data to enable faster decision-making;
- speeds innovation of data products;
- reduces technical debt;
- improves interoperability across data sets;
- increases the reuse of data; and
- automates security and compliance for analytics and AI use cases.
Dash mesh vs. data lake
A data lake centralizes all data to improve reuse across the organization. The focus is on improving the infrastructure for ingesting data and then transforming it after it has been stored. Concerns about data quality and transformation are applied after the fact by data engineering teams that might not be familiar with how the data was collected and what it means.
A data mesh engages domain experts to clean up data as it enters the system. As with DevOps in software development, this process identifies defects much earlier in the data lifecycle, where it's cheaper and easier to remediate. Data engineering teams focus on the infrastructure to enable data domain experts to create their own data products.
Data mesh vs. data fabric
A data fabric is a design concept that uses technical analysis and semantic data to support the design, deployment and use of data infrastructure. Technologies like semantic knowledge graphs, active metadata management and machine learning help monitor and tune data infrastructure. Data fabrics attempt to automate tasks such as discovering and aligning schemas, healing data pipeline failures and profiling data.
A data mesh architecture shifts the focus from a centralized data infrastructure to domain-specific data products. Data engineering teams are required to develop a platform for empowering federated business teams to manage more aspects of data quality on their own. A data mesh may take advantage of data fabrics to help set up a self-service data infrastructure platform along with other data management applications and platforms.
Data mesh design and implementation: Challenges and best practices
Enterprises need to approach data mesh design as an organizational problem. The biggest challenge is change management. Business units that generate their data might not be familiar with how to create data products. Data domain experts need to learn about concepts like data quality, service-level objectives and experience design for data users. Business units may perceive this shift as an additional burden. Business teams might also lack a culture of data literacy that understands how to communicate data requirements and suitability for different use cases.
It's probably best to start with a small project to create a set of data products that are critical to different areas of the business. These teams should identify essential data requirements for existing use cases and then collaborate on the first data product prototype. Over time, they can refine these requirements and establish best practices to ensure high quality and improve the data consumption experience.
It might also be helpful to bring in application product experts to help guide these discussions. Enterprises might also want to introduce data literacy across the organization to help identify ways to use these early data products. Once a baseline is established, data engineering teams can shortlist the kind of self-service infrastructure that might help automate the process of creating and sharing data products.