Tip

Process large data sets quickly with Amazon Elastic MapReduce

Learn how to distribute large amounts of data to Amazon S3 using the Amazon EMR cluster.

Judith Myerson

Published: 20 Jun 2014

Amazon Elastic MapReduce is a useful tool for Amazon Web Services developers who have to process large data sets. EMR can quickly process vast amounts of data -- structured and unstructured -- on a resizable cluster on Hadoop. EMR is useful in a variety of processes, such as log analysis, Web indexing, data warehousing, machine learning, financial analysis, scientific simulations and bioinformatics.

The most common way of getting data onto a cluster is to upload the data to Amazon S3 and use Amazon Elastic MapReduce (EMR) to load the data onto the cluster. Uploads are faster when contiguous parts of data are uploaded in parallel than when data as a whole is uploaded in a single operation.

Hadoop is installed on Amazon EMR clusters and integrated with Amazon S3. You can distribute large amounts of data stored in S3 buckets on the cluster in Hadoop Distributed File System (HDFS).

Before you can create a cluster, you must create Amazon S3 buckets. You upload any required scripts or data to be referenced in the cluster to the buckets. The following are bucket examples:

EMR can quickly process vast amounts of data.

s3://myawsbucket/script/MapperScript.py

s3://myawsbucket/logs

s3://myawsbucket/input

s3://myawsbucket/output

To create an Amazon S3 bucket:

Log in at the Amazon S3 console.
Click Create Bucket.
Enter a bucket name, such as "myawsbucket".
Select the Region for your bucket. Choose the same region as your cluster.
Click Create.

Next, you give yourself (the owner) read and write access and authenticated users read access.

To set permission for yourself:

In the Buckets pane, right-click the bucket you just created.
Select Properties.
In the Properties pane, select the Permissions tab.
Leave the Grantee field as is. It indicates you're the owner.
To the right of the Grantee field, leave List, Upload/Delete and View Permissions to their default settings.

Related glossary terms

Bioinformatics

To set restricted permissions for others:

Click Add more permissions.
Select Authenticated Users in the Grantee field.
To the right of the Grantee field, select List. Do not check off other permissions.
Click Save.

To create EMR cluster logs:

Click Logging on Permissions pane.
Select Enabled.
Enter Target Bucket.
In the Target Prefix box, enter logs.
Click Save.

If you enable logging in the Create a Bucket wizard, it enables only bucket access logs, not Amazon EMR cluster logs.

To create folders in the bucket:

Click the bucket.
Right-click and select Create Folder.
Type in "input" and click the check button.
Repeat the second step for creating output and script folders.

To upload files:

Click the bucket and then a folder.
Right-click and select Upload.
To upload a file (up to 5 TB), click Add Files.
In a separate window, look for the files you want to upload and then click Open.
Click Start Upload.

To upload a folder, click Enable Enhanced Uploader (BETA). This Java Applet requires Java SE7 Update 51 or later.

When your data size reaches 100 MB, you should consider using AWS SDK for Java for uploading objects using multipart upload API. Multipart upload allows you to upload a single object as a set of contiguous parts in parallel. You can upload these object parts independently and in any order over time. Once you start the multipart upload, there is no expiration date (by default).

Learn more about

Amazon S3

Amazon EMR

Hadoop

If transmission of any part fails due to a network error, you can retransmit the part without affecting other parts. This means the time and frequency of retries is shorter when you retransmit a part than a large object. Smaller part size minimizes the impact of restarting a failed upload.

After all parts of your object are uploaded, Amazon S3 assembles these parts in proper order as a single object. When you are done with multipart uploads, you need to disable them with a predefined bootstrap action.

To disable multipart uploads:

Open the Amazon Elastic MapReduce console.
Click Create Cluster.
Go to the Bootstrap Actions section.
In the Add bootstrap action field, select Configure Hadoop and click Configure and Add.
In Optional arguments, replace the default value with the following:

"-c fs.s3n.multipart.uploads.enabled=false".

Click Add.

To figure out how you want the cluster configured, go to Plan an Amazon EMR Cluster. You can determine which formats Amazon EMR can return, how to write data to an Amazon S3 bucket you don't own, and compress the output of your cluster.

In conclusion, uploading the data to Amazon S3 to get data on a cluster on Hadoop is a useful developer's tool. Multipart upload is a preferred method of uploading parts of large data sets in parallel.

About the author:
Judith M. Myerson is the former ADP security officer/manager at a naval facility, where she led enterprise projects for its Materiel Management System. Currently a consultant and subject matter expert, she is the author of several books and articles on cloud use, compliance regulations, mobile security, software engineering, systems engineering and risk management. She received her master of science degree in engineering from the University of Pennsylvania and is certified in risk and information system control (CRISC).

Next Steps

Startup broadens its operations to include all-flash nodes