Definition

AWS Data Pipeline (Amazon Data Pipeline)

AWS Data Pipeline is an Amazon Web Services (AWS) tool that enables an IT professional to process and move data between compute and storage services on the AWS public cloud and on-premises resources. 

AWS Data Pipeline manages and streamlines data-driven workflows, which includes scheduling data movement and processing. The service is useful for customers who want to move data along a defined pipeline of sources, destinations and data-processing activities.

Using a Data Pipeline template, an IT pro can access information from a data source, process it and then automatically transfer results to another system or service. Access to the Data Pipeline is available through the AWS Management Console, the command-line interface or service APIs.

An activity is an action that AWS Data Pipeline performs, such as a SQL query or command-line script. A developer can associate an optional precondition to a data source or activity, which ensures that it meets specified conditions before running an activity. AWS Data Pipeline includes several standard activities and preconditions for services like Amazon DynamoDB and Amazon Simple Storage Service (S3).

A developer can manage resources or let AWS Data Pipeline manage them. AWS-Data-Pipeline-managed resource options include Amazon EC2 instances and Amazon Elastic MapReduce (EMR) clusters. The service provisions an instance type or EMR cluster, as needed, and terminates compute resources when the activity finishes.

Examples

data scientist would assign a job to AWS Data Pipeline so that it accesses log data from Amazon S3 every hour and then transfers that data to a relational database or a NoSQL database for future analysis. As another example, AWS Data Pipeline can transform data to an SQL format, make copies of distributed data, send data to Amazon Elastic MapReduce (Amazon EMR) applications, or process scripts to send data to Amazon S3, Amazon Relational Database Service or Amazon DynamoDB.

The AWS Data Pipeline service is suited to workflows already optimized for AWS, but it can also connect to on-premises data sources, as well as third-party data sources. Installing the Java-based Task Runner package on local servers will continuously poll AWS Data Pipeline to enable it to work with on-premises resources.

Pricing

AWS Data Pipeline charges vary according to the region in which customers use the service, whether they run on premises or in the cloud, and the number of preconditions and activities they use each month.

AWS provides a free tier of service for AWS Data Pipeline. New customers receive three free low-frequency preconditions and five free low-frequency activities each month for one year. These low-frequency activities and preconditions run no more than once a day.

This was last updated in May 2017

Continue Reading About AWS Data Pipeline (Amazon Data Pipeline)

Dig Deeper on AWS cloud development

App Architecture
Cloud Computing
Software Quality
ITOperations
Close