Running an Apache Spark Cluster on your local machine is a natural and early step towards Apache Spark proficiency. As I imagine you are already aware, you can use a YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR. There are numerous options for running a Spark Cluster in Amazon, Google or Azure as well. But, running a standalone cluster on its own is a great way to understand some of the mechanics of Spark Clusters.
In this post, let’s start understanding our Spark cluster options by running a Spark cluster on a local machine. Running a local cluster is called “standalone” mode. This post will describe pitfalls to avoid and review how to run Spark Cluster locally, deploy to a local running Spark cluster, describe fundamental cluster concepts like Masters and Workers and finally set the stage for more advanced cluster options.
If you are new to Apache Spark or want to learn more, you are encouraged to check out the Spark with Scala tutorials or Spark with Python tutorials.
Spark Cluster Standalone Setup Requirements
This tutorial assumes and requires you have already downloaded Apache Spark from http://spark.apache.org
After download, this spark cluster tutorial assumes you have expanded the downloaded file (unzipped, untarred, etc.) into a directory in your environment.
Finally, it assumes you have opened a terminal or command-prompt and are the root of the downloaded and Spark directory.
For example, if you downloaded a file called spark-2.4.0-bin-hadoop2.7.tgz spark-2.4.0-bin-hadoop2.7.tgz and expanded into a directory such as /dev/spark-2.3.2-bin-hadoop2.7/, you are at terminal prompt at /dev/spark-2.3.2-bin-hadoop2.7/
Spark Cluster Standalone Steps
1. Start the Spark Master from your command prompt *
You should see something like the following:
starting org.apache.spark.deploy.master.Master, logging to /Users/toddmcgrath/Development/spark-1.1.0-bin-hadoop2.4/sbin/../logs/spark-toddmcgrath-org.apache.spark.deploy.master.Master-1-todd-mcgraths-macbook-pro.local.out
Open this file to check things out. You should be able to determine that http://localhost:8080 is now available for viewing:
As a refresher, the Spark Application Master is responsible for brokering resource requests by finding a suitable set of workers to run the Spark applications.
2. Start a Spark Worker
todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077
Note : I tried a shortcut to starting a Spark Worker by expecting some defaults. I made my first screencast here: http://youtu.be/pUB620wcqm0
Three failure attempt examples when starting a Spark worker:
1. bin/spark-class org.apache.spark.deploy.worker.Worker
2.bin/spark-class org.apache.spark.deploy.worker.Worker spark://localhost:7077
3. bin/spark-class org.apache.spark.deploy.worker.Worker spark://127.0.0.1:7077
Finally, I tried using the URL from Spark console:
toddmcgrath$ bin/spark-class org.apache.spark.deploy.worker.Worker spark://todd-mcgraths-macbook-pro.local:7077
Verify the Worker by viewing http://localhost:8080. You should see the worker:
As a refresher, Spark Workers are responsible for processing requests sent from the Spark Master.
3. Connect REPL to Spark Cluster
todd-mcgraths-macbook-pro:spark-1.1.0-bin-hadoop2.4 toddmcgrath$ ./bin/spark-shell --master spark://todd-mcgraths-macbook-pro.local:7077
If all goes well, you should see something similar to the following:
Spark assembly has been built with Hive, including Datanucleus jars on classpath Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.1.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_65) Type in expressions to have them evaluated. Type :help for more information. 14/12/06 12:44:32 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 2014-12-06 12:44:33.306 java[22811:1607] Unable to load realm info from SCDynamicStore 14/12/06 12:44:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context available as sc.
And there you go. You are ready to proceed.
Spark Cluster Conclusion
The next Spark Cluster tutorial is Part 2 Deploy Scala Program to Spark Cluster.
For a more detailed analysis of standalone configuration options and scripts, see https://spark.apache.org/docs/latest/spark-standalone.html#cluster-launch-scripts
This example of running a spark cluster locally is to ensure we’re ready to take on more difficult concepts such as using cluster managers such as YARN and Mesos. Also, we’ll cover configuring a Spark EC2 based clusters on Amazon.
Also, before we move on to more advance Spark cluster setups, we’ll cover deploying and running a driver program to a Spark cluster and deploying 3rd party jars with Spark Scala.
* This post will use a Mac, so translate to your OS accordingly.
One thought on “Apache Spark Cluster Part 1: Run Standalone”
It is poitining to hdfs location ? how to make it consider local files by default