AWS Glue and Azure Data Factory serve similar purposes. Both provide managed extract, transform and load services. Organizations can use these services to build integrated data pipelines in the cloud.
There are, however, important differences between Glue and Data Factory. The main differences lie in pricing models as well as support for data connectors and SQL Server Integration Services (SSIS) packages. Understanding these differences is crucial when it’s time to choose the right cloud ETL service for your workloads.
What is AWS Glue?
AWS Glue is a managed service on the Amazon cloud. It lets users collect, process and move data across data pipelines.
AWS Glue is a serverless offering; it doesn’t require that users set up and manage the underlying ETL hosting infrastructure. AWS Glue provides the functionality businesses need to create ETL pipelines. The only requirement for the user is defining a data pipeline and the processes they want to run as data moves through it.
What is Azure Data Factory?
Azure Data Factory is a managed ETL service on the Microsoft Azure cloud. Like AWS Glue, Azure Data Factory is designed simplify processing and moving data across user-defined pipelines.
Data Factory is also a serverless offering; Azure provides and manages all the underlying infrastructure. Users can focus on their data pipelines without worrying about how those pipelines are hosted.
How to use AWS Glue and Azure Data Factory
The prerequisites for using AWS Glue and Azure Data Factory are essentially the same:
- Data sources. Data sources are the places where data originates. They could be a database; an object storage service, such as Amazon S3 or Azure Blob Storage; or any other place where data is stored or streamed. Although Glue and Data Factory integrate most easily with data sources in their own respective clouds, they can support external data sources.
- Data targets. Data targets are where data is located after it has been processed as part of an ETL pipeline. Usually, data targets are an object storage service or a database.
To build a data pipeline on either service, you must configure rules that define how the service collects data from data sources. The rules also define which processing methods, if any, are applied to data after collection. For example, you could pull data from an external data set to fill in missing information within a data source as part of a data pipeline. Alternatively, you could remove duplicate data entries from a data source.
The rules specify which targets the data should be pushed to after processing is complete. Users can manage Glue and Data Factory through cloud consoles or command line interface tools.
AWS Glue vs. Azure Data Factory: The main differences
Although Glue and Data Factory offer similar services and operate comparably, there are small but important differences between them.
Glue's pricing model is more standardized and, as a result, likely more predictable.
Glue charges mainly by data processing unit (DPU) hours. The DPU charge is consistent across most types of AWS Glue jobs and operations. At the time of publication, the DPU charge is $0.44/DPU-Hour in the AWS U.S. East (Ohio) region. There may be extra fees to extract data from data sources and for data catalog storage. There are no extra fees for services like pipeline runs.
With Azure Data Factory, you face more cost variables. Azure charges $0.25 per data integration unit, a unit similar to DPUs. Azure charges separate fees for the amount of time a pipeline is operational, data read/writes and total pipeline runs.
AWS Glue and Azure Data Factory both provide a variety of data connectors. Connectors let the services connect to data stores that serve as data sources. However, there are some differences in which data stores they support.
Data Factory offers connector support for Microsoft products. Glue can integrate with the most popular Microsoft-based data stores, including SharePoint, but it doesn’t currently offer connectors for stores like Microsoft Access.
AWS Glue and Azure Data Factory both import packages from SSIS. This a platform for creating data pipelines using Microsoft SQL Server. SSIS is commonly used to set up data pipelines in on-premises environments. Some businesses have SSIS packages that they want to migrate into the cloud to serve as the basis of their serverless ETL pipelines.
Importing SSIS packages to AWS Glue requires more effort than with Azure Data Factory. Glue requires packages be converted, whereas Azure Data Factory lets users deploy and run SSIS packages directly without converting or migrating them.