How to use batch operations to process S3 objects
AWS customers rely on Amazon S3 to store massive amounts of data across multiple buckets. There are a number of features available to help users get the most out of the service, including one that simplifies operations at scale.
Enterprises use Amazon S3 Batch Operations to process and move high volumes of data and billions of S3 objects. This S3 feature performs large-scale batch operations on S3 objects, such as invoking a Lambda function, replacing S3 bucket tags, updating access control lists and restoring files from Amazon S3 Glacier.
In this short video tutorial, take a closer look at the Batch Operations feature and learn how to use it in your S3 environment. Navigate through the S3 Management Console web browser to create an S3 batch job that copies S3 files to a desired location. This demonstration will start with three buckets: one that hosts the source files, another for inventory files and reports, and a third that serves as the destination for files.
Learn how to create an inventory of your files, and then use that inventory to start your first batch operation. The example in this video will demonstrate the copy operation used to move files from the source bucket to the destination bucket and all the other requirements along the way.
Transcript - How to use batch operations to process S3 objects
Hello, and welcome to this video where I'm going to show you how to create an Amazon Web Services S3 batch job. So, you can use Amazon S3 Batch Operations to perform large-scale batch operations on Amazon S3 objects. You can execute a single operation on lists of Amazon S3 objects that you specify. A job is a basic unit of work. A job contains all the information necessary to execute specific operations on a specific list of objects. Let's get started.
So, you're going to need an S3 bucket with some files in that bucket. I've got that, and it's called techsnipsuk. And I've got five test files in there. And I've also got a bucket, which I'm going to use as a destination bucket. So, in this video, I'm basically just going to copy those files to another bucket.
So, what's the first thing we need to do? Well, we need to get an inventory up and running. We need to go into our source bucket, which is techsnipsuk. Go into Management, and then click on Inventory. You'll see, we don't have an inventory to find at the moment, so we click on Add New. I would give the inventory a name. This inventory will be used by the batch operations for moving our files. We'll just select the techsnipsuk inventory as our destination. And we are going to take the default output format of CSV. S3 Batch Operations can use this particular format without having to modify it. Object versions, we're going to stay with the current version, and we're not going to use any of the older versions that are within that bucket.
And we can also specify optional fields, but we are going to go with the defaults for the purpose of this demonstration. [For] encryption, I'm selecting None at the moment, but it is available. So, we click on Save. Now, two things happen. First, it tells you the inventory is successfully saved. It can take up to 48 hours to deliver the first report. Then, you'll see the second thing that's happened is the bucket policy is successfully created. This policy is applied to the destination bucket that will hold the inventory. So, this just allows us to have the inventory file created and then written to the destination bucket.
If I go into our inventory bucket, you'll see some inventory files that I created earlier. And here's the file we're looking for: It's manifest.json. So, to actually start a batch operation, we need to click on Create job. They ask for the manifest format. It's S3 inventory report, and then the path to that manifest. So, let's have a look in our texts, that inventory bucket, and we'll select the latest one we have.
You can put an object version ID; this is for buckets with versioning enabled. For the purpose of this demonstration, we're not going to put an ID in.
Now, we come to the operations. So, we can choose PUT, which is just basically a copy. We can invoke a Lambda function. We can replace the tags on the bucket. We can replace the access control list or we can actually restore files. For our purposes, we're going to choose PUT. It's asking us the destination bucket. We'll select browse, and we will select mytechsnipbucket.
OK, so I've received a warning, and it's basically that versioning is not enabled on the bucket. I'm not going to enable that at the moment; we don't need to for the purposes of this demonstration. We'll just click that we acknowledge the existing objects with the same name will be overwritten, which essentially means that if we run this two or three times, those files are going to be overwritten.
[For] storage class, we'll choose standard for frequently accessed data. And encryption, again, I'm just going to choose none at the moment. You can use S3-managed encryption keys or KMS [Key Management Service], as well. Access control list, we're not going to change this. We will stay with the same permissions as our source object. And tags, we're going to copy the existing tags. We'll also choose the option to copy existing metadata. We need to give this a description. Again, we'll just leave it as the default.
The priority, we can modify this after numerous batch jobs running, and you can assign them a priority. Then, we have a completion report, which we'll request. And the scope of that will be on all tasks. So, the destination, let's just put that in the inventory bucket.
Now, permissions. We're going to use an IAM role, identity access management role. I have defined one already, called S3 batch role. Now, defining IAM roles is out with the scope of this particular video. But if we go to click on View, you'll see that this is basically just a simple role. I've called it s3BatchRole. I've given it full access to S3. So, we just click on Next. Now, we come to the review page. It tells us the region and manifests. We're doing this in EU (London). The path to the manifest, now, remember, the manifest is your inventory of files. The operation we want to carry out, it's going to be a PUT operation. So, we're going to copy files to mytechsnipbucket.
You'll then see, the additional options, we requested our report after it. And the role we're going to use is s3BatchRole. So, we just click on Create job. And there we have the job there. So, we can go and we can view the details of the job. This gives us, this time, the status. So, you can see a 0% complete, total succeeded, zero. What we want to do now is Confirm and run. See our job details now come up, and we click Run job.
So, we go back to batch operations, and we just refresh. This should run just shortly. There, you'll see it's now complete, and it's been a total of five objects moved. This is 100% complete. The completion report is in our inventory bucket.
So, let's go back to our buckets, and let's have a look in the destination bucket. And there you'll see our five files have been moved.
So, if we go into our inventory bucket, you'll now see the job report. And now, we've downloaded that report. Now, you'll see that it's told us the file name, the bucket and its status, which is successful.