A data pipeline is a network system that allows data to be moved from a source location to a target location.
Organizations can have thousands of intelligent data pipelines that perform data movements from specialized source systems to equally specialized target systems and applications. With so many pipelines, it is important to simplify them as much as possible. As more and more organizations seek to integrate data and analytics into their business operations, the role and importance of intelligent data pipelines will equally grow.
To support data pipelines, organizations require:
- A graphical-based specification and development environment. These environments are used for defining, developing, quality testing, deploying and version controlling the library of specialized data pipelines.
- A data pipeline monitoring application that monitors, manages and troubleshoots all data pipelines.
Organizations also need data pipeline development, maintenance and management processes that treat data pipelines as specialized software assets.
Additionally, logic and algorithms can be built into a data pipeline to convert it into an intelligent pipeline. Intelligent pipelines can then be specialized for source systems, pulling data, data transformations occurring in the pipeline and any unique data and analytic requirements for the target system or application.
As machine learning and AutoML become more prevalent, data pipelines will increasingly become more intelligent. With these processes, data pipelines could continuously learn and adapt based upon the source systems, required data transformations and enrichments, the evolving business and operational requirements of the target systems and applications.
What is the purpose of a data pipeline?
The purpose of a data pipeline is to automate and scale common and repetitive data acquisition, transformation, movement and integration tasks. A properly constructed data pipeline strategy can accelerate and automate the processing associated with the gathering, cleansing, transforming, enriching and moving of data to downstream systems and applications. As the volume, variety, and velocity of data continue to grow, the need for data pipelines that can linearly scale within cloud and hybrid cloud environments are becoming increasingly critical to the operations of a business.
Who needs a data pipeline?
A data pipeline is needed for any operational or business process that requires regular automated aggregating, cleansing, transforming and distribution of data to downstream data consumers. Typical data consumers include:
- operational monitoring and alerting systems;
- management reporting and dashboards;
- business intelligence analysis;
- business analysts;
- data science teams; and
- data stores such as data warehouses, data lakes and cloud data warehouses.
Many data pipelines also move data between advanced data enrichment and transformation modules, where neural network and machine learning algorithms can create more advanced data transformations and enrichments. This includes segmentation, regression analysis, clustering, and the creation of advanced indices and propensity scores.
How does a data pipeline work?
A data pipeline automates the processing of moving data from one source system to another downstream application or system. The data pipeline development process starts by defining what, where and how data is collected. It captures source system characteristics such as data formats, data structures, data schemas and data definitions.
The data pipeline then automates the processes of extracting, transforming, combining, validating and loading data. The data is then used for further operational reporting, business analysis, data science advanced analytics and data visualizations.
Data processing activities of a pipeline can either be processed as a sequential, time-series flow, or divided into smaller processing chunks that can take advantage of parallel processing capabilities.
What are the different types of data pipeline technologies?
Here are some general data pipeline operating modes:
- Batch. Batch processing is most useful when an organization wants to move large volumes of data at a regularly scheduled interval, and the data does not need to be moved in real time. For example, it might be useful for integrating marketing data into a larger system for analysis.
- Real time or streaming. Real-time or streaming processing is useful when an organization processes data from a streaming source, such as the data from financial markets or internet of things (IoT) devices and sensors. Real-time processing captures data as it comes off the source systems in real time, performs rudimentary data transformations (filters, samples, aggregates, calculates averages, determines min/max values) before firing off data to the downstream process.
- Event-driven. Event-driven processing is useful when a predetermined event occurs on the source system that triggers an urgent action (such as anti-lock brakes, airbags, fraud detection or fire detection). When the predetermined event occurs, the data pipeline extracts the required data and transfers the data to a downstream process.
Each of these data pipeline operating modes requires the following technologies:
- Extract, Transform, Load (ETL) is the process of copying data from one or more source systems into a target system. The target system is where rudimentary data transformations (such as filtering, aggregations, sampling, calculating averages or in-row calculations) is performed.
- Data integration tools perform the data extraction, transformations, cleansing and mapping of data. Data integration tools can be integrated with data management, data governance and data quality tools.
- Scripting languages are a programming language for a special run-time environment that automates the execution of tasks. Scripting languages are often interpreted, rather than compiled.
- SQL programming language is a domain-specific language used in programming. It is designed for managing data held in relational database management systems, or for stream processing in a relational data stream management system.
Open source tools are becoming more prevalent regarding data pipeline tools. Open source tools are most useful when an organization needs a low-cost alternative to a commercial vendor. They are also useful when an organization has the specialized expertise to develop or extend the tool for their unique processing purposes.
What is the difference between an ETL pipeline and a data pipeline?
An ETL pipeline refers to a set of processes that run on a scheduled basis. ETLs extract data from one system, perform rudimentary transformations to data and loads the data into a repository such as a database, data warehouse or a data lake.
A data pipeline, however, refers to a set of advanced data processing activities that integrates both operational and business logic to perform advance sourcing, transformation and loading of data. A data pipeline can run on either a scheduled basis, in real-time (streaming) or triggered by some predetermined rule or set of conditions.
Data pipeline summary
It is critical for any data engineer or big data engineer to develop their soft skills -- meaning their interpersonal and communication skills. This will help employees feel comfortable collaborating with their business and operational stakeholders to identify, validate and prioritize their data prior to launching one's data pipeline development project. It also allows employees to discuss any reporting and analytics requirements. These skills are also necessary for ongoing conversations with an organization's stakeholders to prioritize new development, existing maintenance and end-of-life decisions for data pipelines.