https://www.techtarget.com/searchaws/definition/AWS-Glue
AWS Glue is a cloud-based and serverless data integration service that helps users to prepare data for analysis through automated extract, transform and load (ETL) processes. This managed service offers a simple and cost-effective way for categorizing and managing big data in the enterprise and using it for various applications, like machine learning (ML), application development and analytics.
AWS Glue from Amazon Web Services (AWS) simplifies the discovery, preparation, movement, integration and formatting of information from disparate data sources, both on premises and on the AWS cloud. It also makes it easy to manage and organize data in the centralized AWS Glue Data Catalog, where it can be analyzed and the results of the analyses used to inform business decisions.
Multiple data integration capabilities are available with AWS Glue, including the following:
The service also includes DataOps tools for authoring jobs, running jobs and implementing business workflows. These capabilities and tools support both technical and nontechnical users -- from developers to business users. AWS Glue's integration interfaces and tools also support different kinds of workloads, including ETL and streaming, enabling organizations to maximize the value and usability of their data.
AWS Glue orchestrates ETL jobs and extracts data from many cloud services offered by AWS. The service generates appropriate output streams, depending on the application, and incorporates them into data lakes and data warehouses. It uses application programming interfaces (APIs) to transform the extracted data set for integration and to help users monitor jobs.
Users must define jobs in AWS Glue to enable the ETL process on data from the source to the target. They must also determine which source data populates the target (destination) and where the target data resides in order for AWS Glue to generate the data transformation code. These sources and targets can be any of the following AWS services:
In addition, AWS Glue supports Java Database Connectivity (JDBC)-accessible databases, MongoDB, other marketplace connectors and Apache Spark plugins as data sources and destinations.
Users can utilize triggers to put ETL jobs on a schedule or pick specific events that trigger a job. Once triggered, AWS Glue extracts the data, transforms it based on scripts that are either generated automatically by AWS Glue or provided by the user in the AWS Glue console or API, and transforms the data from the data source to the target. The scripts contain the programming logic that is required to perform the data transformation. The transformation happens based on the metadata table definitions in AWS Glue Data Catalog, which are populated based on the user-defined crawler for data store sources, and the triggers defined to initiate jobs.
The service can automatically find an enterprise's structured or unstructured data when it is stored within data lakes in S3, data warehouses in Amazon Redshift and other databases that are part of Amazon RDS. Glue also supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud instances in Amazon Virtual Private Cloud.
The service then profiles data in the data catalog, which is a metadata repository for all data assets that contain details such as table definition, location and other attributes. A team can also use Glue Data Catalog as an alternative to Apache Hive Metastore for Amazon EMR applications.
To pull metadata into Glue Data Catalog, the service uses Glue crawlers, which scan raw data stores and extract schema and other attributes. An IT professional can customize crawlers as needed.
The core features of AWS Glue are the following:
AWS Glue offers benefits for users looking to discover, prepare, move and integrate data for analytics use cases. For example, it offers a range of data integration capabilities. In addition, it supports many types of workloads, including ETL, ELT, batch, centralized cataloging and streaming, without requiring inflexible lock-in. Depending on the workload, users can select from one of many serverless and scalable data processing and integration engines, such as AWS Glue for Ray, AWS Glue for Python Shell or AWS Glue for Apache Spark.
These capabilities are available in a single serverless service. Consequently, users don't have to worry about managing the underlying infrastructure and can focus on discovering, preparing and integrating data for their specific applications.
AWS Glue can scale on demand at the petabyte level and supports different data types and schemas. It also offers pay-as-you-go hourly billing for any data size, meaning users only pay for the time the ETL job takes to run. This enables users to set up appropriate ETL jobs to maximize data value, while controlling costs.
Another benefit of AWS Glue is seamless integration with AWS analytics services and Amazon S3 data lakes. Its integration interfaces make it easy to integrate data across the organization's infrastructure and use it for different analytics applications and workloads.
Finally, AWS Glue includes a built-in Data Quality feature that helps maintain data quality across data lakes and pipelines by generating useful and actionable metrics. It also enables users to automatically create, manage and monitor data quality rules. If quality deteriorates, AWS Glue raises alerts that notify users and enable them to take appropriate action.
The following are additional benefits of AWS Glue:
The drawbacks of AWS Glue include the following:
The primary data processing functions performed by AWS Glue are the following:
These capabilities are useful for organizations that manage big data and want to avoid data lake pollution -- hoarding more data than an organization can use. AWS Glue is also useful for organizations that want to visually compose data transformation workflows and run ETL jobs of any complexity on a serverless Apache Spark-based ETL engine.
Some other common use cases for Glue are the following:
Some more specific common use case examples for Glue are as follows:
After data is cataloged, it is searchable and ready for ETL jobs. AWS Glue includes an ETL script recommendation system to create Python and Spark (PySpark) code, as well as an ETL library to execute jobs. A developer can write ETL code via the Glue custom library or write PySpark code via the AWS Glue console script editor.
A developer can import custom PySpark code or libraries. In addition, developers could upload code for existing ETL jobs to an S3 bucket and then create a new Glue job to process the code. AWS also provides sample code for Glue in a GitHub repository.
AWS Glue jobs can execute on a schedule. A developer can schedule ETL jobs at a minimum of five-minute intervals. AWS Glue cannot handle streaming data.
If a dev team prefers to orchestrate its workloads, the service enables scheduled, on-demand and job completion triggers. A scheduled trigger executes jobs at specified intervals, while an on-demand trigger executes when prompted by the user. With a job completion trigger, single or multiple jobs can execute when jobs finish. These jobs can trigger at the same time or sequentially, and they can also trigger from an outside service, such as AWS Lambda.
AWS charges users an hourly rate, billed by the second for data discovery crawlers, ETL jobs and provisioning development endpoints to interactively develop ETL code. The exception to this billing scheme is DataBrew jobs, which are billed per minute, and DataBrew interactive sessions, which are billed per session.
AWS also charges users a monthly fee to store and access metadata in AWS Glue Data Catalog. That said, the first million objects stored in the catalog, along with the first million accesses, are free. Usage of AWS Glue Schema Registry is also offered free by AWS.
AWS Glue and Azure Data Factory have key differences, despite being similar ETL services. Compare AWS Glue vs. Azure Data Factor to see which best suits your organization's data integration and ETL requirements. Also, check out how different companies use AWS serverless tools and technologies to process and analyze data as part of their IT strategies.
20 Aug 2024