Apache Spark Cluster Part 1: Run Standalone

Spark console

Running an Apache Spark Cluster on your local machine is a natural and early step towards Apache Spark proficiency.  As I imagine you are already aware, you can use a YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR.  There are numerous options for running a Spark Cluster in Amazon, Google or Azure as well.  But, running a standalone cluster on its own is a great way to understand some of the mechanics of Spark Clusters.

In this post, let’s start understanding our Spark cluster options by running a Spark cluster on a local machine.  Running a local cluster is called “standalone” mode.  This post will describe pitfalls to avoid and review how to run Spark Cluster locally, deploy to a local running Spark cluster, describe fundamental cluster concepts like Masters and Workers and finally set the stage for more advanced cluster options.

If you are new to Apache Spark or want to learn more, you are encouraged to check out the Spark with Scala tutorials or Spark with Python tutorials.

Spark Cluster Standalone Setup Requirements

This tutorial assumes and requires you have already downloaded Apache Spark from http://spark.apache.org

After download, this spark cluster tutorial assumes you have expanded the downloaded file (unzipped, untarred, etc.) into a directory in your environment.

Finally, it assumes you have opened a terminal or command-prompt and are the root of the downloaded and Spark directory.

For example, if you downloaded a file called spark-2.4.0-bin-hadoop2.7.tgz spark-2.4.0-bin-hadoop2.7.tgz and expanded into a directory such as /dev/spark-2.3.2-bin-hadoop2.7/, you are at terminal prompt at /dev/spark-2.3.2-bin-hadoop2.7/

Spark Cluster Standalone Steps

Let’s begin

1. Start the Spark Master from your command prompt *

./sbin/start-master.sh

You should see something like the following:

starting org.apache.spark.deploy.master.Master, logging to /Users/toddmcgrath/Development/spark-1.1.0-bin-hadoop2.4/sbin/../logs/spark-toddmcgrath-org.apache.spark.deploy.master.Master-1-todd-mcgraths-macbook-pro.local.out

Open this file to check things out.  You should be able to determine that http://localhost:8080 is now available for viewing:

Spark UI
Spark UI

As a refresher, the Spark Application Master is responsible for brokering resource requests by finding a suitable set of workers to run the Spark applications.

2. Start a Spark Worker

todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077

Note : I tried a shortcut to starting a Spark Worker by expecting some defaults.   I made my first screencast here: http://youtu.be/pUB620wcqm0

Three failure attempt examples when starting a Spark worker:

1. bin/spark-class org.apache.spark.deploy.worker.Worker

2.bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077

3. bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077

Finally, I tried using the URL from Spark console:

toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077

———

Verify the Worker by viewing http://localhost:8080.  You should see the worker:

spark-worker-running

As a refresher, Spark Workers are responsible for processing requests sent from the Spark Master.

3.  Connect REPL to Spark Cluster

todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ ./bin/spark-shell --master spark://todd-mcgraths-macbook-pro.local:7077

If all goes well, you should see something similar to the following:

Spark assembly has been built with Hive, including Datanucleus jars on classpath
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.1.0
      /_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65)
Type in expressions to have them evaluated.
Type :help for more information.
14/12/06 12:44:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2014-12-06 12:44:33.306 java[22811:1607] Unable to load realm info from SCDynamicStore
14/12/06 12:44:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Spark context available as sc.

And there you go.  You are ready to proceed.

Spark Cluster Conclusion

The next Spark Cluster tutorial is Part 2 Deploy Scala Program to Spark Cluster.

For a more detailed analysis of standalone configuration options and scripts, see https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts

This example of running a spark cluster locally is to ensure we’re ready to take on more difficult concepts such as using cluster managers such as YARN and Mesos.  Also, we’ll cover configuring a Spark EC2 based clusters on Amazon.

Also, before we move on to more advance Spark cluster setups, we’ll cover deploying and running a driver program to a Spark cluster and deploying 3rd party jars with Spark Scala.

* This post will use a Mac, so translate to your OS accordingly.

2 thoughts on “Apache Spark Cluster Part 1: Run Standalone

Leave a Reply

Your email address will not be published. Required fields are marked *