I have been dealing with projects that are using datasets that don’t fit in memory, or even on my laptop. I have also been taking a class where a number of classmates reached out for help setting up an ‘Big Data’ system on AWS. The class is having us use Pig, Hive, & Spark to analyze healthcare records and apply machine learning and clustering methods to the data.
I found myself repeating myself on the setup process, so I decided to write a post outlining how I have been working with setting up a system that has Hive, Pig, Spark, Zeppelin installed on AWS.
Zeppelin is a relatively new notebook that works with scala, sql, and more!
Step 0 – Sign up For AWS
I feel like this goes without saying, but experience taught me otherwise. https://aws.amazon.com/
Step 1 – Launch EMR
The first step you need to do is log into your AWS console. Once in you can click on the EMR icon/link.
That will bring you to a page where you can create your own new and shiny cluster.
To install all of the software we want on our cluster we are going to have to go to the advance options.
AWS now supports all the standard big data software out of the box. Just check what you wanted installed. In this case I am make sure I have hadoop 2.7, spark 1.6, hive 1.0, pig 0.14, and Zeppelin 0.5.5.
Click next and you will be taken to the Hardware tab. I always add storage to my EMR clusters. The rule of thumb I use is I tripple the size of the data I am going to load into the cluster. I also make sure I have enough space on the master node to store my data before I load it into the cluster.
I also choose my instance states at this point. For this cluster I am choose r3.xlarge because of the increase memory. Each node is currently $0.09/hour – so this cluster I am make is $0.45/hour. A relatively modest cost to analyze 100Gb of data. You can find the EMR pricing here.
You can click next and give your cluster an identity. This is also a spot where you can choose bootstrap actions that will allow you to include additional software. I will skip this for this post.
Now its time to choose your keypair. This is where if you have not done this before you can cause some issues. Read this link completely: https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-plan-access-ssh.html
Now that you are done with all the steps you create the cluster and wait!
Step 2 – Edit Security Groups
While you are waiting for the cluster to load you are going to want to make sure you have access to the master node. You are going to need to ssh in (port 22), access to Zeppelin (port 8890). From the cluster details window you can click on “security group for Master”
You will want to change the inbound settings so you can contact the server from your IP address.
Add ssh or custom TCP rules if they do not exist. I recommend you choose the tab (“My IP”) to limit access from your location. Bots to scan AWS ports and can/will connect to open connections. Open up your 22, and 8890 ports.
Now wait until your cluster is ready.
Step 3 – SSH and Load Data
Once your cluster is read and waiting for you, you can ssh into your master node to load your data into the hadoop file system. I keep my data in private s3 buckets, so I will be taking advantage of the AWS Command Line Tool that comes installed with EMR.
This is where you will need your keypair to ssh into your master node. I will show the mac/linux method. Windows users will have to look up the Putty instructions.
FIRST TIME USERS: be sure to chmod 400 <keypair>.pem before you try to connect to your master node.
ssh -i &amp;lt;keypair&amp;gt;.pem hadoop@&amp;lt;master-node-address&amp;gt;
Now that you are connected to the master node you can copy data into hdfs. I am first going to copy my class data from an S3 bucket to the local master node.
I want to access the data from Zeppelin, and EMR sets up an Zeppelin user. To access the data from the zeppelin notebook you have load it into the right HDFS directory .
hdfs dfs -put &amp;lt;data&amp;gt; /user/zeppelin
The data is now loaded into your hadoop file system in the cluster. It is also accessible from zeppelin.
Step 3 – Use Zeppelin
Now that you have your data loaded you can now take advantage of spark and zeppelin. Open a browser and type in the web-address of your master-node but connect on port 8890. Create a new notebook.
There is a spark context automatically loaded. You can type ‘sc’ to see it. You can also load your data that is in hdfs in the /user/zeppelin location. What is nice about Zeppelin is that you can shell commands, mark down, scala, and python at your finger tips.
Step 4 – Save your Notebooks.
Data is not saved in the EMR cluster, so you need to save out before you terminate your cluster. I always make a new s3 bucket then copy the local notebooks to the s3 bucket.
aws s3 mb s3://zeppelin_notebooks aws s3 sync /var/lib/zeppelin s3://zeppelin_notebooks
Step 5 – Terminate your Cluster
Now that you have done your big data analysis with your fancy new cluster, you need to terminate your cluster to avoid accruing costs. From your cluster dashboard you can hit terminate.
I always have termination protection on, so I have to turn it off before I terminate.
After a few minutes the cluster will be terminate and your done.
I appreciate you taking the time to get this far into my post. Let me know if it was helpful.