What is AWS Glue?
AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load (ETL) processes. The managed service offers a simple and cost-effective method for categorizing and managing big data in the enterprise. It provides organizations with a data integration tool that formats information from disparate data sources and organizes it in a central repository, where it can be used to inform business decisions.
How AWS Glue works
Glue uses ETL jobs to extract data from a combination of other cloud services offered by Amazon Web Services (AWS) and incorporates it into data lakes and data warehouses. It uses application programming interfaces (APIs) to transform the extracted data set for integration, and to help users monitor jobs.
Users can put ETL jobs on a schedule or pick events that will trigger a job. Once triggered, Glue extracts the data, transforms it based on code that Glue generates automatically, and loads it into Amazon S3 or Amazon Redshift. Glue then writes metadata from the job into the embedded AWS Glue Data Catalog.
The service can automatically find an enterprise's structured or unstructured data when it is stored within data lakes in S3, data warehouses in Amazon Redshift and other databases that are part of the Amazon Relational Database Service. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud (EC2) instances in an Amazon Virtual Private Cloud.
The service then profiles data in the data catalog, which is a metadata repository for all data assets that contain details such as table definition, location and other attributes. A team can also use the Glue Data Catalog as an alternative to Apache Hive Metastore for Amazon Elastic MapReduce applications.
To pull metadata into the Data Catalog, the service uses Glue crawlers, which scan raw data stores and extract schema and other attributes. An IT professional can customize crawlers as needed.
Features of AWS Glue
The core features of Glue are as follows:
- Automatic schema discovery. Glue allows developers to automate crawlers to obtain schema-related information and store it in the data catalog, which can then be used to manage jobs.
- Job scheduler. Glue jobs can be set and called on a flexible schedule, either by event-based triggers or on demand. Several jobs can be started in parallel, and users can specify dependencies between jobs.
- Developer endpoints. Developers can use these to debug Glue, as well as creating custom readers, writers and transformations, which can then be imported into custom libraries.
- Automatic code generation. The ETL process automatically generates code, and the only input necessary is a location/path for the data to be stored. The code is in either Scala or Python.
- Integrated data catalog. Acts a singular metadata store of data from a disparate source in the AWS pipeline. An AWS account has one catalog.
Benefits and drawbacks of using Glue
The benefits of AWS Glue are as follows:
- Fault-tolerance. Failed jobs in Glue are retrievable, and logs in Glue can be debugged.
- Filtering. Filters for bad data.
- Support. Supports several non-native Java Database Connectivity (JDBC) data sources.
- Maintenance and deployment. Simple maintenance and deployment, because the service is completely managed by AWS.
The drawbacks of AWS Glue include:
- Limited compatibility. While AWS Glue does work with a variety of commonly used data sources, it only works with services running on AWS. Organizations may need a third-party ETL service if sources are not AWS-based.
- No incremental data sync. All data is staged on S3 first, so Glue is not the best option for real-time ETL jobs.
- Learning curve. Teams using Glue should have a strong understanding of Apache spark.
- Relational database queries. Glue has limited support for queries of traditional relational databases, only SQL queries.
AWS Glue use cases
The primary data processing functions Glue performs to organize enterprise data are as follows:
- Data extraction. Glue extracts data in a variety of formats.
- Data transformation. Glue reformats data for storage.
- Data integration. Glue integrates data into enterprise data lakes and warehouses.
This is useful for organizations that manage big data and want to avoid data lake pollution, which is a case of hoarding more data than an organization can use. Specifically, Glue is for organizations that run ETL jobs on a serverless Apache Spark-based platform.
Some more specific common use case examples for Glue are as follows:
- Glue can integrate with Snowflake data warehouse to help manage the data integration process.
- AWS data lake can integrate with Glue.
- AWS Glue can integrate with Athena to create schemas.
- ETL code can be used for Glue on GitHub as well.
After data is cataloged, it is searchable and ready for ETL jobs. AWS Glue includes an ETL script recommendation system to create Python and Spark (PySpark) code, as well as an ETL library to execute jobs. A developer can write ETL code via the Glue custom library, or write PySpark code via the AWS Glue console script editor.
A developer can also import custom PySpark code or libraries. In addition, developers could upload code for existing ETL jobs to an S3 bucket, then create a new Glue job to process the code. AWS also provides sample code for Glue in a GitHub repository.
Schedule, orchestrate ETL jobs
AWS Glue jobs can execute on a schedule. A developer can schedule ETL jobs at a minimum of five-minute intervals. AWS Glue cannot handle streaming data.
If a dev team prefers to orchestrate its workloads, the service allows scheduled, on-demand and job completion triggers. A scheduled trigger executes jobs at specified intervals, while an on-demand trigger executes when prompted by the user. With a job completion trigger, single or multiple jobs can execute when jobs finish. These jobs can trigger at the same time or sequentially, and they can also trigger from an outside service, such as AWS Lambda.
AWS Glue pricing
AWS charges users a monthly fee to store and access metadata in the Glue Data Catalog. There is also a per-second charge with AWS Glue pricing, with either a minimum of 10 minutes or 1 minute (depending on the Glue version users have), for ETL job and crawler execution. AWS also includes a per-second charge to connect to a development endpoint for interactive development.