kentoh - Fotolia
A DataOps pipeline is an Agile framework that many enterprises have adopted to better manage their data. It provides a backbone for streamlining the lifecycle of data aggregation, preparation, management and development for AI, machine learning and analytics. It promises substantial improvement to traditional approaches to data management in terms of agility, utility, governance and quality of data-enhanced applications. The core idea lies in architecting data processes and technology to embrace change.
"DataOps brings a software engineering perspective and approach to managing data pipelines similar to the trend created in DevOps," said Sheel Choksi, solutions architect at Ascend.io, a data engineering company.
Traditional data management approaches focused on schema changes, but Choksi emphasized the importance of also including shifting business requirements, delivering for new stakeholders and integrating new data sources.
DataOps pipeline planning needs to address automated tools that support quick-change management with version control, data quality tests and continuous integration/continuous delivery pipeline deployment. This can enable a more iterative process that increases a team's output while decreasing overhead.
What is a DataOps pipeline?
DataOps is unique because it adopts professional practices from DevOps in software engineering to professionalize analytics in a way that will speed up analytics projects' time to value, said Sean Spediacci, senior product marketing manager at Fivetran.
DataOps is not a technology, but rather an approach to building and deploying analytic models efficiently. Teams need to consider the various steps, including maintenance, testing, data models, documentation and logging.
A DataOps pipeline builds on the core ideas of DataOps to solve the challenge of managing multiple data pipelines from a growing number of data sources in a way that supports multiple data users for different purposes, said Jason Tolu, product marketing director at Talend. This requires an overarching data management and orchestration structure to help provide a single source of truth to make decisions and run the business.
A DataOps pipeline can also be particularly effective for data scientists operating in multi-cloud environments, where data governance and ownership are becoming growing concerns. As companies continue to scale on the cloud and enterprises find themselves collecting data from more and more disparate sources, having a single well to draw from will be essential.
Data teams have traditionally encountered a different set of conflicts with operations teams than developers. Where developers only need to focus on quality code, data teams need to address integration, data quality and governance. This motivated the advent of DataOps, which offers a collaboration between data creators and data consumers to develop and deliver analytics, explained Ameesh Divatia, co-founder and CEO at Baffle, a cloud data protection firm.
This collaboration has become even more important with the rise in public cloud adoption. Analysts want to analyze data quickly, but that data must be moved into the analytics domain in the cloud where applications query the data.
Spediacci said the main benefits of a DataOps pipeline are the repeatability, automation, consolidation and standardization of processes that are often ad hoc and not well documented. This enables data professionals to focus on scaling data delivery to their customers -- both internally and externally -- and encourages collaboration between teams that were previously siloed within their own unique workflows.
A DataOps pipeline can also increase the ROI of data science teams by automating the process of curating data sources and managing infrastructure, said Tolu. DataOps pipelines can also automate the governance and security necessary to ensure data is secure and in compliance with data protection laws. For example, DataOps pipelines could potentially flag personally identifiable information that's been aggregated in a data set it's not supposed to be in.
Managing DataOps challenges
One big challenge in building a DataOps pipeline lies in repurposing data warehouses and data lakes to be more agile. Data repositories were created to break application silos, but they were not designed to handle operational use cases in real time, said Yuval Perlov, CTO at K2View, a data management tools provider.
Enterprises may also face challenges vetting the privacy and security capabilities of a DataOps pipeline, particularly as the scale of data increases. Divatia recommended that firms think about including security teams to work through some of these issues as a sort of DataSecOps process. This is akin to the DevOps migration toward DevSecOps for secure coding in the development community. This ensures privacy and security are built into the foundation of a cloud transformation, reducing the risk of data exposure.
Teams also need to strike the right balance between people, process and technology in building out a DataOps pipeline.
"Determining which approach makes the most sense for a particular organization, and in what order, is up to wide discussion in the DataOps community," Choksi said.
A DataOps pipeline also needs to address various integration challenges to smooth the flow of data across legacy systems, databases, data lakes and data warehouses that may lie on premises or in the cloud. Choksi recommended teams explore various tactical approaches for addressing integration challenges, including separating pipelines into smaller and more composable pieces, adding in self-service so all stakeholders are empowered to pursue their goals, and investing in new tools and testing processes for a lower-risk change.
Why choose DataOps
A DataOps pipeline may provide companies with the appropriate framework for extracting meaningful value from data.
"For the last few years now, everyone has been saying the same thing: Data is the new oil," Perlov said. "But the application of data doesn't matter if you don't have the right data, at the right place, at the right time."
The management and delivery of data are critical to meeting modern data requirements. DataOps embraces the idea that the data pipelines that guide how data is collected, modeled and delivered are just as critical as the applications and analytics it powers, Perlov said.