https://www.techtarget.com/searchdatamanagement/definition/data-pipeline
A data pipeline is a set of network connections and processing steps that moves data from a source system to a target location and transforms it for planned business uses. Data pipelines are commonly set up to deliver data to end users for analysis, but they can also feed data from one system to another as part of operational applications.
As more and more companies seek to integrate data and analytics into their business operations, the role and importance of data pipelines are equally growing. Organizations can have thousands of data pipelines that perform data movements from source systems to target systems and applications. With so many pipelines, it's important to simplify them as much as possible to reduce management complexity.
To effectively support data pipelines, organizations require the following components:
The data pipeline is a key element in the overall data management process. Its purpose is to automate and scale repetitive data flows and associated data collection, transformation and integration tasks. A properly constructed data pipeline can accelerate the processing that's required as data is gathered, cleansed, filtered, enriched and moved to downstream systems and applications.
Well-designed pipelines also enable organizations to take advantage of big data assets that often include large amounts of structured, unstructured and semistructured data. In many cases, some of that is real-time data generated and updated on an ongoing basis. As the volume, variety and velocity of data continue to grow in big data systems, the need for data pipelines that can linearly scale -- whether in on-premises, cloud or hybrid cloud environments -- is becoming increasingly critical to analytics initiatives and business operations.
A data pipeline is needed for any analytics application or business process that requires regular aggregation, cleansing, transformation and distribution of data to downstream data consumers. Typical data pipeline users include the following:
To make it easier for business users to access relevant data, pipelines can also be used to feed it into BI dashboards and reports, as well as operational monitoring and alerting systems.
The data pipeline development process starts by defining what, where and how data is generated or collected. That includes capturing source system characteristics, such as data formats, data structures, data schemas and data definitions -- information that's needed to plan and build a pipeline. Once it's in place, the data pipeline typically involves the following steps:
Many data pipelines also apply machine learning and neural network algorithms to create more advanced data transformations and enrichments. This includes segmentation, regression analysis, clustering and the creation of advanced indices and propensity scores.
In addition, logic and algorithms can be built into a data pipeline to add intelligence.
As machine learning -- and, especially, automated machine learning (AutoML) -- processes become more prevalent, data pipelines likely will become increasingly intelligent. With these processes, intelligent data pipelines could continuously learn and adapt based on the characteristics of source systems, required data transformations and enrichments, and evolving business and application requirements.
These are the primary operating modes for a data pipeline architecture:
Event-driven processing can also be useful in a data pipeline when a predetermined event occurs on the source system that triggers an urgent action, such as a fraud detection alert at a credit card company. When the predetermined event occurs, the data pipeline extracts the required data and transfers it to designated users or another system.
Data pipelines commonly require the following technologies:
Open source tools are becoming more prevalent in data pipelines. They're most useful when an organization needs a low-cost alternative to a commercial product. Open source software can also be beneficial when an organization has the specialized expertise to develop or extend the tool for its processing purposes.
An ETL pipeline refers to a set of integration-related batch processes that run on a scheduled basis. ETL jobs extract data from one or more systems, do basic data transformations and load the data into a repository for analytics or operational uses.
A data pipeline, on the other hand, involves a more advanced set of data processing activities for filtering, transforming and enriching data to meet user needs. As mentioned above, a data pipeline can handle batch processing but also run in real-time mode, either with streaming data or triggered by a predetermined rule or set of conditions. As a result, an ETL pipeline can be seen as one form of a data pipeline.
Many data pipelines are built by data engineers or big data engineers. To create effective pipelines, it's critical that they develop their soft skills -- meaning their interpersonal and communication skills. This will help them collaborate with data scientists, other analysts and business stakeholders to identify user requirements and the data that's needed to meet them before launching a data pipeline development project. Such skills are also necessary for ongoing conversations to prioritize new development plans and manage existing data pipelines.
Other best practices on data pipelines include the following:
11 May 2023