In today's data-driven economy, companies can't afford to have data-related issues, but many still do. Despite the exploding volume of data organizations continue to amass, they're still having trouble accessing and using that data.
To accelerate the speed and accuracy of data analytics insights, data engineers are constructing data analytics pipelines -- or data pipelines -- to operationalize data.
What is a data analytics pipeline?
An analytics pipeline streamlines data flow to improve the speed and quality of insights. Similar to a continuous integration/continuous delivery (CI/CD) pipeline used by a DevOps team, the speed advantage of an analytics pipeline hinges on automating tasks.
"If the owner of a finance group asks me for a cash flow report, I may have to extract the data manually [and] update that record myself," said Dan Maycock, principal of engineering and analysis at hop farm Loftus Labs. "When I'm manually extracting data every time it's requested, it doesn't happen as frequently. If I have a pipeline, that's happening automatically."
According to Pieter Vanlperen, managing partner at PWV Consultants, a process modernization consultancy, other things that require at least some automation in the analytics pipeline include data governance, data quality, data usability and categorization, depending on how advanced the pipeline is.
Having more than one analytics pipeline is common for various reasons, as each may serve a different purpose. Colleen Tartow, director of engineering at Starburst Data, a distributed SQL query engine platform provider, said data engineering is critical to pipeline function as they are often complex and vary in maturity.
"You could have a straightforward cloud-native pipeline using a modern data stack, or you could have a data center-based infrastructure that requires constant management alongside the actual data pipeline itself," she said.
Maycock uses one pipeline to transport data from its original source to a central repository and another pipeline to transport data from the central repository to a map, BI tool or data model.
"In the early 2000s when I started, you were pretty much on your own building and maintaining [pipelines], but that isn't the case anymore," he said.
Other benefits of an analytics pipeline
Analytics pipelines can help organizations achieve higher levels of agility and resiliency, especially when they're built iteratively.
"The idea is that you're iterating on your designs through the canvas on which the pipeline is built. The benefit is higher productivity," said Arvind Prabhakar, CTO of StreamSets, a DataOps platform provider.
Analytics pipelines, like CI/CD pipelines, also provide visibility across the engineering and operations functions, which enables continuous feedback loops, faster iteration and quicker issue resolution. According to Prabhakar, the previous generation for platforms and tooling treated data operations as hidden workloads.
"In this new world of DataOps where every end point, every pipeline is [potentially] the weakest link, you need the ability to constantly monitor and manage because the pipelines themselves are a reflection of how your data architecture is evolving," Prabhakar said.
And cross-functional visibility into the analytics pipeline can help enable process improvements. Data observability makes sure business needs and processes are modeled in the analytics pipeline as well, Prabhakar said.
"These pipelines are not just artifacts of the design choices that data engineers made," he said. "They actually reflect business processes that are engrained in the fabric of the enterprise's data architecture."
Analytics pipeline scalability
Scalability is essential so the data analytics pipeline can adapt to growing data volumes. However, it is also important to consider not only scalability, but also how to integrate with existing analytics capabilities in data architecture.
When building a scalable data analytics pipeline, consider both input data and output data. Knowing the context of input data and how much can help determine the format to store the data and the technology to do so. Consider end users when it comes to output data. Data analysts rely heavily on this information, so the output data must be accessible and transparent for them.
Also consider how much data the analytics pipeline can ingest. Infrastructure must be able to handle a sudden change in data volume, for example, due to business growth. One option is to set up the pipeline in the cloud to allow for further flexibility and, ultimately, scalability.
Challenges with creating an analytics pipeline
The point of an analytics pipeline is to expedite the delivery of data, but a common obstacle is the data itself.
"I might have built a pipeline, but I really don't have any more information because the data warehouse or the data lake I built is so poorly governed that it's a swamp," Vanlperen said.
He said poor governance can quickly make data unusable. It's important to understand which data sources are important and tweak them so they can be useful, he said.
The diversity of data sources can also be problematic.
"Every software platform can have its own API and their own data model [because] there's not necessarily a role in software development specifying how data is presented to a data pipeline or an ETL platform," Maycock said. "Being able to connect to and extract data, depending on how foreign that platform is, can be somewhat difficult, as well as being able to access the information in a consistent way."
Another issue organizations face is that no one is responsible for understanding the full inventory of what data is available in-house and from third-party sources. Some argue that's a telltale sign of needing a chief data officer or at least someone responsible for understanding and operationalizing data.
"Ten years ago, the data engineer was expected to know everything, and they were given a big docket which contained all the specifications of the data infrastructures," Prabhakar said. "Now, the data engineer has no clue of where the data is coming from, who owns it [or] where it originated, let alone the schema, structure and semantics."
Also 10 years ago, data engineers and operations personnel often worked in data silos, which should no longer be the case because disconnects between groups can create friction that slows value delivery. Cross-functional disconnect can also negatively impact business operations. For example, if the analytics pipeline starts losing 10% data, the downstream analytics results would be dubious.
"When you talk about continuous operations, the goal of the pipeline is to establish a tight feedback loop between the data engineers and the operators," Prabhakar said. "You want the pipelines to automatically start raising a flag that something has changed."
Analytics pipelines are essential for any insight-driven organization. When designed and implemented well, they can help a company meet its strategic goals sooner.