Notebooks have become an essential component of cloud-based AI research and analysis, so data scientists and developers should know how to use them if they deploy machine learning models on AWS.
Perhaps one of the most popular options today is Jupyter Notebook, an open source tool data scientists use to work with machine learning models and to process and analyze data. Jupyter notebooks are easy to share and teams can use them to collaborate on live code. Jupyter Notebook essentially provides an environment to document and run your code, then visualize those results.
Some Amazon cloud services, such as Amazon SageMaker, integrate Jupyter Notebook into their machine learning capabilities. SageMaker, AWS' managed machine learning service, relies on Jupyter Notebook capabilities for data visualization, statistical modeling, model training and more. And even though AWS continues to expand SageMaker's capabilities, users should still learn how to host their own Jupyter notebooks on AWS to get the most out of machine learning in the cloud.
In this tutorial, we will create a Jupyter notebook on an Amazon EMR cluster based on a small EC2 instance. Users can modify the size and capability of the hardware that supports the cluster depending on their workload.
To get started, we will create an EMR cluster in VPC mode using a private subnet with an attached network address translation (NAT) gateway so it can download a sample notebook from GitHub.
Next, we will create the required networking for our EMR cluster, which includes a VPC with two subnets. The first subnet will be public, and we will create and attach an internet gateway to our VPC. Then, we will create a routing rule for our public subnet to use the internet gateway. For the second subnet, we will create a NAT gateway. We will create this in our public subnet and attach it to our VPC. Then, we will create a private routing rule for our private subnet.
After we create the Jupyter Notebook, we will need to modify the default security group by allowing an outbound HTTPS rule. This will enable our notebook to communicate with the Git repository hosted in GitHub.
When setting up a Jupyter Notebook on AWS, make sure the EMR cluster has the required software -- Hadoop, Spark or Livy. Then we will create a notebook on top of the cluster and link it to the GitHub repository after our firewall rule is in place to allow outbound HTTPS traffic. We will open the Jupyter notebook and open a sample Jupyter notebook from the Git repository we connected to.
Follow along with the video above for a closer look at how to set up a Jupyter notebook on AWS.
Transcript - Set up a Jupyter notebook on AWS with this tutorial
In this snip, we will be creating a Jupyter notebook on top of an EMR cluster in AWS. To get started from the Amazon EMR service, click Create cluster. Then select Go to advanced option. We can click Next and go to the hardware section.
Now, we need to set up our networking. To do that, we will need a public and private subnet. Let's go ahead and click Create a VPC. This will open up a new tab where we can create our VPC. Click VPCs and Create. We will name ours NotebookVpc. And we'll use a 10.0.0.0/16 CIDR and click Create.
Next, we'll click the link to our VPC, and we can create our two subnets. Click Subnets and Create subnet. We'll name my first notebook public [notebookPublic_subnet], and we'll select the VPC we just created. It will make sure our CIDR block goes inside of our VPC CIDR block. I'll use 10.0.1.0/24.
Next to make it public will create an internet gateway. Select Internet Gateways and choose Create. We will name this notebook_gateway, and we'll attach it to our VPC. Click Actions and Attach to VPC and select the VPC we created.
Great. So this gives our public subnet a way to get out to the internet. But we have to add a routing rule so it knows how to use it. Go back to Subnets. Select our public subnet and click the route table. We will select the Routes tab. Then Edit routes and Add route. We will use all zeros [0.0.0.0/0] to specify all routable IP addresses and then select Internet gateway and Save.
Jupyter notebooks can only run on private networks. So let's set that up next. Go back to Subnet and Create subnet. We will name this one notebookPrivate_subnet and use our new VPC. And our new CIDR will be 10.0.2.0/24. Select Create and Close.
We want Jupyter Notebook to be able to reach out to Git in order to download and upload its Jupyter notebooks to do that, we'll need a NAT gateway. Let's create one now.
Click NAT Gateways and Create NAT Gateway. We will select our public subnet and use an existing Elastic IP address or create a new one.
Click Create NAT Gateway. We'll click Edit route tables and select Create route table. This will be our private route table [privateTable]. We will put it in our new VPC. And then we'll edit the route table by selecting Routes and then Edit routes. Then, Add route 0.0.0.0/0. Our target will be our new NAT gateway.
Next, we'll need to attach this private route table to our private subnet. Click Actions, Edit subnet associations. Select notebookPrivate_subnet and Save.
Now we have a private and public subnet set up and ready to build our EMR cluster on. So let's go back to our previous tab. Refresh this page.
A Jupyter notebook requires EMR version 5.18 or greater and Hadoop, Spark and Livy. We can choose our software configuration here, so we'll just pick the things that we need.
Go back to hardware. And now we can see our notebook VPC. Also, our private subnet should be selected as our EC2 subnet. Next, you can decide how big you want your EMR cluster to be. For demonstration purposes, I'll leave it at the default, but this is dependent on your workload. Click Next.
We'll name our cluster NotebookCluster. Click Next and Create cluster.
Great, now our cluster is starting up. Next thing we'll do is add a Git repository that points to where we want to store our Jupyter notebook. Click Git repositories and Add repository. For this demo, we'll be using a repository provided by AWS Labs. We will name our repository and use our Git URL (https://github.com/awslabs/amazon-sagemaker-examples). For Branch, we'll select master. Because this is a public Git repository, we can use a public repository without credentials. But if you're using a private repo, you can set your credentials here.
Awesome. Now we have our cluster starting and our Git repository. Finally, we can create our notebook, click Notebooks and Create notebook. We will name it [demoNotebook]and provide a description. We'll choose our cluster. We can use the default security group, but we'll need to edit it later. And we'll go ahead and create our notebook.
While our notebook is starting, select Services and VPC. Then select Security Groups.
At this point, we'll need to wait a few minutes for our cluster to be built and our notebook to be created inside of it. Once that's done, we can refresh this and see that we have a new security group with the word "editor" in it. And after just about four or five minutes, here it is -- ElasticMapReduceEditors-Editor.
We will edit the outbound rules and click Edit rules. We will add a rule and we'll leave it as custom TCP and add port range 443. This will allow all outbound HTTPS traffic, which is exactly what we need to connect to GitHub. Click Save rules and Close.
Now let's go back to our EMR console. And we can see our testNotebook status is ready.Let's go ahead and click on it. We will scroll to the Git repository section and click Link new repository.
Here we will link our SageMaker demo. Or, if you have your own Git repository, you'd link that. Click Link repository. After about 30 seconds, you can refresh and you'll see the link status is changed to "linked."
Awesome. Now let's open our Jupyter notebook. Scroll the top and select Open in Jupyter. We can see our SageMaker demo. Let's click on that folder. And you can pick any actions, but I'm going to pick the reinforcement_learning and then deeprace_robomaker_coach_gazebo. and I'll open these up the DeepRacer notebook.
Awesome. Now we have a fully functioning Jupiter notebook. And we're ready to run some code and really take advantage of the cluster we built. Thanks for watching.