
Prep data for machine learning with AWS analytics services
Data preparation is crucial when building and training machine learning models with SageMaker AI. What AWS analytics services can admins use to simplify the process?
Organizations that want to take advantage of machine learning capabilities require a comprehensive data preparation strategy.
Data preparation consists of making data sets available to ML algorithms. In many cases, these algorithms need access to large amounts of data. Before these ML algorithms can access that data, it needs to be imported, processed and stored in a format suitable for analysis. This involves complex processes, as well as large storage and compute capacity.
Here, explore some of the key capabilities of Amazon Athena, EMR and Redshift -- three data analytics services that integrate seamlessly with SageMaker AI to help IT teams navigate the data selection process. Understanding the unique strengths of each service empowers businesses to deliver more accurate, reliable ML models.
Select the right AWS analytics service
Amazon SageMaker AI is an AWS-managed service that delivers cloud infrastructure, workflows and development tools to build, train, deploy and maintain ML models in the cloud. While SageMaker AI supports access to multiple tools for data preparation tasks, the nature of the application and its data requirements dictate the best AWS analytics service for a particular ML use case.
Amazon Athena
Athena is a query service that analyzes data files in S3 using SQL statements. Since it is serverless, users do not need to set up any or manage infrastructure. It is a cost-efficient option because users only pay for the queries they run. It is also a flexible service since it supports files in various formats, such as JSON, CSV, Apache ORC and Apache Parquet. It is also the best option to run ad hoc queries for data in S3.
One common use case for Athena is log analysis to identify issues and troubleshoot. Queuing log data can also help businesses optimize their processes by analyzing performance metrics.
Amazon EMR
Amazon EMR, previously Elastic MapReduce, is a big data processing service. It launches and manages clusters that run open source data analytics frameworks, such as Apache Spark, Apache Hadoop, Apache Flink, Apache Hive and Trino. EMR can access data in a cluster's local file system, Hadoop Distributed File System (HDFS) or S3. Although EMR manages compute infrastructure using EC2 instances, it also supports a serverless configuration. Athena can query data using Amazon EMR, and it supports the same data formats.
EMR provisioned clusters are a good option for jobs that require long processing tasks with a predictable workload and accessing data in HDFS or externally in S3.
Amazon Redshift
Redshift follows a data warehouse model, where extract, transform and load processes store large data sets from various sources inside a cluster. Once in the cluster, SQL statements can analyze these data sets. It is a useful tool to run queries that need to fetch and join data from multiple large tables. Redshift also manages the cluster's compute infrastructure, which is typically provisioned on EC2 instances. However, it also has the option to configure serverless compute capacity.
Redshift is a good option for predictable, high-volume workloads with data that has been converted and stored internally in a Redshift cluster.

Integrate AWS analytics services
SageMaker Unified Studio is an integrated development environment (IDE) that gives users access to AWS' data, analytics and AI/ML capabilities in a single platform. It integrates with Athena, EMR and Redshift using its SQL extension feature to ease data preparation tasks. In many cases, organizations already use these services for data analytics tasks outside of SageMaker AI. This makes it easier to reuse existing infrastructure and access it for ML building and training processes.
AWS Glue manages connections and catalogs for the data sources queried by Athena, EMR and Redshift. Users must ensure they can analyze data from their AWS analytics service using SQL statements through their IDE interfaces or SDK APIs. It's recommended to first create, execute and fine-tune these SQL statements from the AWS analytics service before running these queries from Sagemaker AI workflows.
Remember to grant the required Identity and Access Management permissions to the SageMaker domain that will run these data analysis tasks. These permissions must include access to relevant S3 buckets, AWS Glue catalogs and databases, as well as permissions to execute tasks in the respective AWS analytics service. Users must also configure network access, such as VPC routing and security groups, between SageMaker Unified Studio and the data analytics platform.
The SQL extension in JupyterLab notebooks is recommended for getting started with these data analytics integrations. It provides a SQL editor UI where developers can type specific SQL commands pointing to connections and databases managed by AWS Glue. Amazon Q Developer is also available in JupyterLab, which is a useful generative AI-based tool that can assist and guide developers through the process.
Ernesto Marquez is owner and project director at Concurrency Labs, where he helps startups launch and grow their applications on AWS. He enjoys building serverless architectures, building data analytics solutions, implementing automation and helping customers cut their AWS costs.