Data pipelines deliver the fuel for data science, analytics
While analytics and data science lead to informed decision-making, the systems that ingest data from its source and prepare it for consumption are what make analysis possible.
Data pipelines are the movers of data.
They're what gets data from its source to its destination where it can be used for analytics and data science to fuel the decisions and actions that move organizations forward, and they're what prepares all that data for analysis along the way.
Without data pipelines, organizations only have a morass of untapped data.
But data pipelines are complex.
They're not made up of single tools that simply transport data from its source into a BI platform. Data must go through numerous steps along the way to ready it for analysis, and each of those steps is a segment of the data pipeline unto itself.
At the beginning is data ingestion. That's followed by data transformation, data operations (DataOps) and, finally, data orchestration.
And with both the number of sources organizations collect data from and the volume of data growing exponentially, the data that pipelines must move and prepare is also increasingly more complex.
Meanwhile, there are vendors that specialize in each of the stages of the data pipeline, as well as some that provide tools for all the stages.
Recently, in advance of a virtual conference about data pipelines hosted by consulting firm Eckerson Group, analyst Kevin Petrie discussed the technology.
In an interview, Petrie, vice president of research at Eckerson Group, spoke about the evolution of data pipelines as data complexity has increased, the individual components that make up modern data pipelines and the many vendors developing tools to meet organizations' pipeline needs.
In addition, he spoke about the future of data pipelines and what trends might emerge over the next few years.
What components make up a modern data pipeline?
Kevin Petrie: You have a source that includes traditional systems such as mainframes. And it might include SAP databases, cloud databases or SaaS applications. Connecting a source to a target -- oftentimes for analytics -- is the pipeline in between.
The data pipeline is going to ingest data, meaning it is going to extract it and load it. The pipeline might be capturing that data, and it might be streaming that data in real time or low-latency increments to the target. Once it arrives at the target, you could be appending that data or merging it with data sets. Along the way -- or perhaps once it arrives at the target -- it needs to be transformed. That will be filtering data sets, combining tables or other objects. It might be restructuring the data or applying a schema to it. It might be cleansing it by removing duplicates and validating data quality.
This construct applies to structured data in rows and columns, semistructured data that is labeled, or fully unstructured data such as open text or pictures.
How does a modern data pipeline differ from the data pipelines of the past?
Petrie: There are pretty significant differences.
There was the era from the late 1990s until the late 2000s when there was periodic batch loading from a relational database to a monolithic on-premises data warehouse. It was the old extract, transform and load paradigm. That supported periodic operational reporting or dashboards. Then, the next chapter was the 2010s up to 2020 when there was an attempt at consolidation by bringing data from a lot of different sources to an often cloud-based data warehouse. That could be done with incremental batches as well as real-time streaming increments. That supported BI/analytics and data science.
Kevin PetrieVice president of research, Eckerson Group
Now, we're in an era of what I call synchronization where there are highly distributed environments -- hybrid, multi-cloud platforms and complex, multidirectional data flows. It's supporting business intelligence, data science and the merging of analytics with operations, which is automating actions and analytical outputs.
Pipelines have gotten faster, more numerous and more complex.
What has changed on a macro level to force the micro-level changes within data pipelines?
Petrie: I look at it in terms of supply and demand, and both are booming.
There's a proliferation of the supply of data -- new data sources, higher data volumes, higher data varieties -- and things are speeding up. There's the three V's -- which are volume, variety, velocity -- that are all increasing on the supply side. On the demand side, the business craves to empower its decision-makers with data to help drive decisions. You've got data democratization putting data and analytics in the hands of more business people, you've got advanced analytics seeking to predict what's around the corner, and you've also got a deepening of business intelligence. It's all intended to help businesses compete more effectively.
And between booming supply and booming demand, you have data pipelines, which are feeling a lot of strain.
If they're feeling strained by booming supply and demand, what are the biggest challenges for data pipelines?
Petrie: There are a lot of challenges, which I would put into three buckets.
The first one is that business requirements can change. The business craves data and analytical outputs more than ever, but also needs to respond quickly to changes in the market, customer actions and competitor actions. As a result, you can have a need for both low-latency data and rapid changes to the actual pipelines.
The second one is bottlenecks. Given the explosions in supply and demand, it's not surprising that bottlenecks develop in the pipelines that connect the two.
The third challenge is complexity. There are more users, use cases, projects, devices, tools, applications and platforms. There's quite a web of elements that pipelines seek to connect and serve.
What are some prominent vendors that provide tools for the different stages of data pipelines?
Petrie: Within data pipeline management, we've defined four market segments -- ingestion, transformation, data operations and orchestration.
Within data ingestion, a primary example of a vendor would be Fivetran. Fivetran really specializes in getting data at low latency with high throughput between many sources and many targets. On the transformation side, there are DBT Labs and Coalesce. Another example is Prophecy. On the DataOps side, there are a lot of different vendors. One example is DataOps.live, which supports DataOps and orchestration in Snowflake environments, in particular. And for orchestration of pipelines, and scheduling and monitoring workloads between pipelines and applications that consume them, [Apache] Airflow is a great example. That's an open source tool, and the vendor that provides that is Astronomer.
There are also vendors that play in multiple segments. Nexla addresses many segments, helping organizations build data products and deliver those. Rivery offers a suite that provides all of those capabilities. And StreamSets provides a lot in terms of ingestion and DataOps capabilities.
What criteria should organizations use when choosing tools to build their data pipelines?
Petrie: There are five parts.
One is performance -- what's the speed, the latency, the throughput at which you can deliver multistructured data from point A to point B? There's also ease of use. Automation is a critical way to reduce the burden on overwhelmed data engineers. A third criteria can be support for and integration with the ecosystem of commercial and open source elements in an environment. That's critical. You can't have proprietary pipelines. You have to be able to integrate with a wide variety of elements.
There's also governance. Pipelines are not governance solutions in themselves, but they need to assist governance within their capabilities and contribute to governance activities. The final one would be breadth of support, both in terms of breadth of capabilities and also breadth of sources and targets. If you can standardize on one pipeline tool that can support not just your current sources and targets, but your potential future ones, you've reduced your change complexity for the future.
Why don't you consider the analytics layer part of the data pipeline?
Petrie: The way I view it is the pipeline is delivering data to the BI layer or the data science layer, which is consuming the output of the data science layer.
That said, there are companies like Tableau that are BI tools reaching down into the pipeline layer.
Looking forward, how will data pipelines evolve over the next few years?
Petrie: There will be consolidation. I can count 20 or 30 data pipeline vendors across the various segments without taking a breath, so there will be some consolidation. There's also the reality that big cloud providers offer many of these capabilities. That's one trend that will happen.
Another trend is that there will continue to be a healthy startup ecosystem. Cloud platforms, with their open ecosystems of APIs and their elastic infrastructure, can foster pretty creative startups that offer a full suite of tools across all four segments -- ingestion, transformation, DataOps and orchestration. Rivery is just one of several startup companies that are offering a suite, kind of going counter to what you would expect of a startup, which is to focus on one niche. With that said, there will continue to be a lot of specialty tools -- Airflow and DBT Labs are examples of specialty capabilities that are going deep on one or two segments.
Is the data pipeline space more ripe for startups than the BI/analytics space?
Petrie: Probably. There's so much going on at both ends of the spectrum -- proliferation of data sources and data targets -- that it does create a lot of opportunity for startups addressing suites or providing specialty capabilities. So, yes, I think there is going to continue to be a lot of innovation, and it might be more than what's happening at the analytics layer.
The interesting thing about the data management space is that there are a lot of moving parts, so there are different combinations of capabilities. We've identified four data pipeline segments, but there are also aspects of data management like governance, cataloging, data fabric and data mesh. All of those have synergies with data pipelines. There's also workflow management at a higher level of the stack. All of that fosters innovation, and startups can offer different combinations of capabilities across the stack.
Editor's note: This Q&A has been edited for clarity and conciseness.
Eric Avidon is a senior news writer for TechTarget Editorial and a journalist with more than 25 years of experience. He covers analytics and data management.