Apache Spark Cluster Part 2: Deploy Scala Program to Spark Cluster

How do you deploy a Scala program to a Spark Cluster?  In this tutorial, we’ll cover how to build, deploy and run a Scala driver program to a Spark Cluster.  The focus will be on a simple example in order to gain confidence and set the foundation for more advanced examples in the future.   To keep things interesting, we’re going to add some SBT and Sublime 3 editor for fun.

This post assumes Scala and SBT experience, but if not, it’s a chance to gain further understanding of the Scala language and simple build tool (SBT).

Requirements

Steps to Deploy Scala Program to Spark Cluster

1. Create a directory for the project: mkdir sparksample

2. Create some directories for SBT:

cd sparksample

mkdir project

mkdir src/main/scala

Ok, so you should now be in the sparksample directory and have project/ and src/ dirs.

(3. We’re going to sprinkle this Spark tutorial with using Sublime 3 text editor and SBT plugins.  So, this step isn’t necessary for deploying a scala program to a spark cluster.  This is an optional step.)

In any text editor, create a plugins.sbt file in projects directory.

Add the sublime plugin according to: Added sublime plugin:https://github.com/orrsella/sbt-sublime)

4. Create a SBT file in root directory.  For this tutorial, the root directory is sparksample/.  Name the file “sparksample.sbt” with the following content

name := "Spark Sample"

version := "1.0"

scalaVersion := "2.10.3"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.1.1"

 

5. Create a file named SparkPi.scala in the src/main/scala directory.  Because this is an introductory tutorial, let’s keep things simple and cut-and-paste this code from the Spark samples.  The code is:

import scala.math.random

import org.apache.spark._

/** Computes an approximation to pi */
object SparkPi {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Spark Pi")
    val spark = new SparkContext(conf)
    val slices = if (args.length > 0) args(0).toInt else 2
    val n = 100000 * slices 
    val count = spark.parallelize(1 to n, slices).map { i =>
      val x = random * 2 - 1
      val y = random * 2 - 1
      if (x*x + y*y < 1) 1 else 0
    }.reduce(_ + _)
    println("Pi is roughly " + 4.0 * count / n)
    spark.stop()
  }
}

 

6. Start SBT from command prompt: sbt

Running sbt may trigger many file downloads of 3rd party library jars.  It depends on if you attempted something similar with SBT in the past and whether your local cache already has the files.

(If you want to continue with Sublime example, run the ‘gen-sublime’ command from SBT console and open the Sublime project.  In the next step, step 6, you can create the sample scala code in Sublime.)

7. In SBT console, run ‘package’ to create a jar.  The jar will be created in the target/ directory.  Note the name of the generated jar; if you follow the previous sparksample.sbt step exactly, the filename wil be spark-sample_2.10-1.0.jar

8. Exit SBT, or in a different terminal window, call the “spark-submit” script with the appropriate –master arg value.  For example:

../spark-1.6.1-bin-hadoop2.4/bin/spark-submit --class "SparkPi" --master spark://todd-mcgraths-macbook-pro.local:7077 target/scala-2.10/spark-sample_2.10-1.0.jar

So, in this example, it’s safe to presume I have the following directory structure:

parentdir

-spark-1.6.1-bin/hadoop2.4

-sparksample

We can assume this because I’m running ../spark-1.6.0-bin-hadoop2.4/bin/spark-submit from the sparksample directory.

9.  You should see output “Pi is roughly…” and if you goto Spark UI, you should see the “Spark Pi” in completed applications:

Spark Cluster completed application
Completed Application after running in Spark Cluster

Conclusion

That’s it.  You’ve built, deployed and ran a Scala driver program to Spark Cluster.  Simple, I know, but with this experience, you are in good position to move to more complex examples and use cases.  Let me know if you have any questions in the comments below.

Screencast

Here’s a screencast of the steps above:

Spark deploy jar to cluster example

 

Further Reference

  • http://spark.apache.org/docs/latest/submitting-applications.html

One thought on “Apache Spark Cluster Part 2: Deploy Scala Program to Spark Cluster

  1. Hi Todd!
    Will you be so kind, to point me a part of documentation, how to and where put config files (property, loggging) for spark-streaming scala application?
    I don’t want to hardcode parameters into my classes.
    Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.