How To: Apache Spark Cluster on Amazon EC2 Tutorial

Spark Cluster on EC2

How to set up and run an Apache Spark Cluster on EC2?  This tutorial will walk you through each step to get an Apache Spark cluster up and running on EC2. The cluster consists of one master and one worker node. It includes each step I took regardless if it failed or succeeded.  While your experience may not match exactly, I’m hoping these steps could be helpful as you attempt to run an Apache Spark cluster on Amazon EC2.  There are screencasts throughout the steps.

Assumptions

This post assumes you have already signed up and have a verified AWS account.  If not, sign up here https://aws.amazon.com/. It assumes you are familiar with running Spark Standalone Cluster and deploying to a Spark cluster.

Approach

I’m going to go through step by step and also show some screenshots and screencasts along the way.  For example, there is a screencast that covers steps 1 through 5 below.

Spark Cluster on Amazon EC2 Step by Step

Note: There’s a screencast of steps one through four at the end of step five below.

1) Generate Key/Pair in EC2 section of AWS Console

Click “Key Pairs” in the left nav and then Create Key Pair button.

Download the resulting key/pair PEM file.

2) Create a new AWS user named courseuser and download the file which includes the User Name, Access Key Id, Secret Access Key.  We need the Key Id and Secret Access Key.

3) Set your environment variables according to the key and id from the previous step.  For me, that meant running the following from the command line:

export AWS_SECRET_ACCESS_KEY=F9mKN6obfusicatedpBrEVvel3PEaRiC

export AWS_ACCESS_KEY_ID=AKIAobfusicatedPOQ7XDXYTA

4) Open a terminal window and goto the root dir of your Spark distribution.  Then, copy PEM file from first step in this tutorial to root of Spark home dir

5) From Spark home dir, run:

ec2/spark-ec2 --key-pair=courseexample --identity-file=courseexample.pem launch spark-cluster-example

I received errors about the PEM file permissions, so I changed according to the error notification recommendation and re-ran the script.

Then, you should receive permission errors from Amazon, so update permissions of courseuser on Amazon and try again.

You may receive an error about zone availability such as:

Your requested instance type (m1.large) is not supported in your requested Availability Zone (us-east-1b). Please retry your request by not specifying an Availability Zone or choosing us-east-1c, us-east-1e, us-east-1d, us-east-1a.

If so, just update the script zone argument and re-run:

ec2/spark-ec2 --key-pair=courseexample --identity-file=courseexample.pem --zone=us-east-1d launch spark-cluster-example

The cluster creation takes approximately 10 min with all kinds output including deprecated warnings and possibly errors starting GANGLIA.  GANGLIA errors are fine if you are just experimenting.  Try a different Spark version or you can tweak PHP settings on your Cluster.

Here’s a screencast example of me creating an Apache Spark Cluster on EC2

Set up an Apache Spark Cluster on Amazon EC2 Part 1

6) After the cluster creation succeeds, you can verify by going to master http://<your-ec2-hostname>.amazonaws.com:8080/

7) And you can verify from Spark console in Spark or Python

Scala example:

bin/spark-shell --master spark://ec2-54-145-64-173.compute-1.amazonaws.com:7077

Python example

IPYTHON_OPTS="notebook" ./bin/pyspark --master spark://ec2-54-198-139-10.compute-1.amazonaws.com:7077

At first, both of these should have issues which eventually lead to an “ERROR OneForOneStrategy: java.lang.NullPointerException”:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.4.1
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66)
Type in expressions to have them evaluated.
Type :help for more information.
16/01/17 07:30:28 ERROR SparkDeploySchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
16/01/17 07:30:28 ERROR OneForOneStrategy: 
java.lang.NullPointerException

8) This is an Amazon permission issue related to port 7077 not being open.  You need to open up port 7077 via an Inbound Rule.  Here’s a screencast on how to create an Inbound Rule in EC2:

Setting up an Apache Spark Cluster on Amazon EC2 Part 2

After creating this inbound rule, everything will work from both ipython notebook and spark shell

Conclusion

Hope this helps you configure a Spark Cluster on EC2.  Let me know in the page comments if I can help.  Once you are finished with your EC2 instances, make sure to destroy using the following command:

ec2/spark-ec2 --key-pair=courseexample --identity-file=courseexample.pem destroy spark-cluster-example

Resources

For a list of additional resources and tutorials, see Spark tutorials page.

Spark EC2 Tutorial Featured Image Credit: https://flic.kr/p/g19ivQ

15 thoughts on “How To: Apache Spark Cluster on Amazon EC2 Tutorial

  1. Hi Supergloo,

    Firstly, I would like to thank you for the detailed tutorial. I am new to spark and would like to move into it. I have a question when following your tutorial and couldn’t find the way out after googling it. If possible, could you help me point out where I might do wrong?

    In step 7, I can verify from Spark console in Spark. However, when I try
    >> IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077

    I got the following error:
    Exception in thread “main” java.lang.IllegalArgumentException: pyspark does not support any application options.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:242)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:241)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:117)
    at org.apache.spark.launcher.Main.main(Main.java:86)

    Could you help me out? Thanks!!

    1. Glad this tutorial is helping you so far. What operating system are you on? Windows, Mac, Linux? Looks like Mac or Linux What happens when run without the IPYTHON_OPTS such as “./bin/pyspark –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077” ?

    2. Also, I’m wondering about the encoding of the hyphen character…. wondering if it’s being converted into a dash. Did you copy-and-paste from this tutorial by chance? If so, try typing in the command instead of copy-and-paste. Let us know how it goes.

      1. Hi, I am using Mac, and it’s indeed a Hyphen & Dash issue. I have successfully open up ipython notebook after correcting it.
        Original: IPYTHON_OPTS=”notebook” ./bin/pyspark -–master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077
        Corrected: IPYTHON_OPTS=”notebook” ./bin/pyspark –master spark://ec2-52-91-57-24.compute-1.amazonaws.com:7077
        (the second hyphen is originally converted to a dash)

        I am appreciated for your reply and the help, thanks!

  2. I am CS student, I’m new to using spark and ec2 and I’m kinda struggling with all these commands as I’m working on Windows.
    So, my question is how can I run these commands on windows ?
    like this one for example
    ec2/spark-ec2 –key-pair=courseexample –identity-file=courseexample.pem launch spark-cluster-example
    thanks.

    1. Hi Nourhan, install Python and try running ec2/spark_ec2.py instead of ec2/spark-ec2. If you open ec2/spark-ec2 in a text editor, you’ll see it just calls the Python spark_ec2.py script. Let us know how it goes.

  3. Thank you for your patience
    I’ve changed the environment variables
    and then from spark\ec2
    I tried running Spark-Ec2.py on its own but nothing happened and then
    I wrote this
    Spark-Ec2.py –Key-Pair=Courseexample –Identity-File=Courseexample.Pem Launch Spark-Cluster-Example
    and changed they key pair name to the one I am using and no action happens
    no errors no messages
    nothing

  4. Hi – I am getting the same error as Johny. But I’ve tried correcting the hyphen , dash issue , but no success. Any clues where I am faltering.

    I am running on VMWare with Ubuntu. (includes Spark 1.6.1 / Java 1.7 / Scala 2.11.8 /Python 2.7) – NO Hadoop installed on this.

    IPYTHON_OPTS=”notebook” /opt/spark-1.6.1/bin/pyspark -master spark://ec2-XX-XX-XX-XX.compute-1.amazonaws.com:7077
    Exception in thread “main” java.lang.IllegalArgumentException: pyspark does not support any application options.
    at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:242)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildPySparkShellCommand(SparkSubmitCommandBuilder.java:241)
    at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:117)
    at org.apache.spark.launcher.Main.main(Main.java:86)

    Regards

  5. Thank you for this detailed tutorial. Everything works as described. I created my AWS EC2 cluster using the spark-ec2 script and also managed to connect my Ipython notebook to the cluster at the AWS master node:7077. However when I tried to run the simple wordcount program (that runs faultlessly in a local standalone mode ) it goes into an indefinite wait stage saying “WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources”

    I checked this out in stackoverflow (where else?) and as advised by some, changed the number of cores to 1 and memory to 512mb — this does not help.

    then i came across another stackoverflow thread [ http://stackoverflow.com/questions/25176197/with-spark-how-to-connect-master-or-solve-an-errorwarn-taskschedulerimpl-init ] quite clearly states that this script sets up the cluster in standalone mode and hence will NEVER accept a remote submit … hence useless!

    finally, i note that the latest version of spark documentation does not refer to this script at all, even though the script is present in the distribution.

    would appreciate any advice in this regards

  6. Thanks for the explanation. I have tried this on VirtualBox with Ubuntu. I did get zone errors and it got resolved as per your suggestion.
    But nowhere I am getting
    But nowhere I am getting this error, InvalidKeyPair.NotFoundThe key pair ‘courseexample’ does not exist
    i have executed below command,
    ec2/spark-ec2 –key-pair=courseexample –identity-file=courseexample.pem –zone=us-east-1d launch spark-cluster-example.
    I do have courseexample.per file in Spark folder.

  7. Hi,
    I am using windows 10 machine.I have installed python 3 and trying to create a spark cluster.
    Where can I find the spark-ec2 file mentioned in step 5

  8. How to install packages on all the nodes in the cluster. I know its possible by going to all the nodes. Is it possible some other way?

Leave a Reply

Your email address will not be published. Required fields are marked *