Apache Spark Cluster Part 1: Run Standalone

Spark console

Running an Apache Spark Cluster on your local machine is natural, early step towards Apache Spark proficiency.  Let’s start understanding Spark cluster options by to running a cluster on a local machine.  Running a local cluster is called “standalone” mode.  This post will describe pitfalls to avoid and review how to run Spark Cluster locally, deploy to a local running Spark cluster, describe fundamental cluster concepts like Masters and Workers and finally set the stage for more advanced cluster options.

Let’s begin

1. Start Master from a command prompt *

./sbin/start-master.sh

You should see something like the following:

starting org.apache.spark.deploy.master.Master, logging to /Users/toddmcgrath/Development/spark-1.1.0-bin-hadoop2.4/sbin/../logs/spark-toddmcgrath-org.apache.spark.deploy.master.Master-1-todd-mcgraths-macbook-pro.local.out

Open this file to check things out.  You should be able to determine that http://localhost:8080 is now available for viewing:

Spark UI
Spark UI

 

The Spark Application Master is responsible for brokering resource requests by finding a suitable set of workers to run the Spark applications.

 

2. Start a Worker

todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077

 

Gotcha Warning: I tried shortcut to starting a Spark Worker by expecting some defaults.   I made my first screencast here: http://youtu.be/pUB620wcqm0

Three fails:

1. bin/spark-class org.apache.spark.deploy.worker.Worker

2.bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077

3. bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077

Finally, I tried using the URL from console:

toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077

———

Verify the Worker by viewing http://localhost:8080.  You should see the worker:

spark-worker-running

 

Spark Workers are responsible for processing requests sent from the Spark Master.

 

3.  Connect REPL to Spark Cluster (KISS Principle)

todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ ./bin/spark-shell --master spark://todd-mcgraths-macbook-pro.local:7077

If all goes well, you should see something similar to the following:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.1.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/12/06 12:44:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2014-12-06 12:44:33.306 java[22811:1607] Unable to load realm info from SCDynamicStore
14/12/06 12:44:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.

 

And there you are.  Ready to proceed.  For more detailed analysis of standalone configuration options and scripts, see https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

This example of running a spark cluster locally is to ensure we’re ready to take on more difficult concepts such as using cluster managers such as YARN and Mesos.  Also, we’ll cover configuring a Spark cluster at Amazon.

Also, before we move on to more advance Spark cluster setups, we’ll cover deploying and running a driver program to a Spark cluster.

 

 

* This post will use a Mac, so translate to your OS accordingly.

2 thoughts on “Apache Spark Cluster Part 1: Run Standalone

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.