How to set up and run an Apache Spark Cluster on EC2? This tutorial will walk you through each step to get an Apache Spark cluster up and running on EC2. The cluster consists of one master and one worker node. It includes each step I took regardless if it failed or succeeded. While your experience may not match exactly, I’m hoping these steps could be helpful as you attempt to run an Apache Spark cluster on Amazon EC2. There are screencasts throughout the steps.
This post assumes you have already signed up and have a verified AWS account. If not, sign up here https://aws.amazon.com/. It assumes you are familiar with running Spark Standalone Cluster and deploying to a Spark cluster.
I’m going to go through step by step and also show some screenshots and screencasts along the way. For example, there is a screencast that covers steps 1 through 5 below.
Spark Cluster on Amazon EC2 Step by Step
Note: There’s a screencast of steps one through four at the end of step five below.
1) Generate Key/Pair in EC2 section of AWS Console
Click “Key Pairs” in the left nav and then Create Key Pair button.
Download the resulting key/pair PEM file.
2) Create a new AWS user named courseuser and download the file which includes the User Name, Access Key Id, Secret Access Key. We need the Key Id and Secret Access Key.
3) Set your environment variables according to the key and id from the previous step. For me, that meant running the following from the command line:
4) Open a terminal window and goto the root dir of your Spark distribution. Then, copy PEM file from first step in this tutorial to root of Spark home dir
5) From Spark home dir, run:
ec2/spark-ec2 – key-pair=courseexample – identity-file=courseexample.pem launch spark-cluster-example
I received errors about the PEM file permissions, so I changed according to the error notification recommendation and re-ran the script.
Then, you should receive permission errors from Amazon, so update permissions of courseuser on Amazon and try again.
You may receive an error about zone availability such as:
Your requested instance type (m1.large) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1e, us-east-1d, us-east-1a.
If so, just update the script zone argument and re-run:
ec2/spark-ec2 – key-pair=courseexample – identity-file=courseexample.pem – zone=us-east-1d launch spark-cluster-example
The cluster creation takes approximately 10 min with all kinds output including deprecated warnings and possibly errors starting GANGLIA. GANGLIA errors are fine if you are just experimenting. Try a different Spark version or you can tweak PHP settings on your Cluster.
Here’s a screencast example of me creating an Apache Spark Cluster on EC2
6) After the cluster creation succeeds, you can verify by going to master http://<your-ec2-hostname>.amazonaws.com:8080/
7) And you can verify from Spark console in Spark or Python
bin/spark-shell – master spark://ec2-54-145-64-173.compute-1.amazonaws.com:7077
IPYTHON_OPTS="notebook" ./bin/pyspark – master spark://ec2-54-198-139-10.compute-1.amazonaws.com:7077
At first, both of these should have issues which eventually lead to an “ERROR OneForOneStrategy: java.lang.NullPointerException”:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.1 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. 16/01/17 07:30:28 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up. 16/01/17 07:30:28 ERROR OneForOneStrategy: java.lang.NullPointerException
8) This is an Amazon permission issue related to port 7077 not being open. You need to open up port 7077 via an Inbound Rule. Here’s a screencast on how to create an Inbound Rule in EC2:
After creating this inbound rule, everything will work from both ipython notebook and spark shell
Hope this helps you configure a Spark Cluster on EC2. Let me know in the page comments if I can help. Once you are finished with your EC2 instances, make sure to destroy using the following command:
ec2/spark-ec2 – key-pair=courseexample – identity-file=courseexample.pem destroy spark-cluster-example
For a list of additional resources and tutorials, see Spark tutorials page.
Spark EC2 Tutorial Featured Image Credit: https://flic.kr/p/g19ivQ