In this Apache Spark cluster deploy tutorial, we’ll cover how to deploy Spark driver programs to a Spark cluster when the driver program utilizes third-party jars. In this case, we’re going to use code examples from previous Spark SQL and Spark Streaming tutorials.
At the end of this tutorial, there is a screencast of all the steps. Also, see Reference section below for Apache Spark Cluster Deploy Part I and II, source code reference and links to the Spark SQL and Spark Streaming tutorials.
- Source code layout
- Install assembly plugin
- Review build.sbt file
- Package Spark Driver Program for Deploy
- Deploy to Spark Cluster
1. Source Code Layout
As you see, there is nothing extraordinary about our source code layout. We’re going to build with `sbt`, so there are the usual suspect directories and files including: src/main/scala, project, project/build.properites and build.sbt file.
2. Install SBT Assembly Plugin
In order to package our 3rd party dependencies into a convenient “fat jar”, we’re going to install and use the sbt-assembly plugin. The plugin is described as “The goal is simple: Create a fat JAR of your project with all of its dependencies”. Installing this plugin is simple, just create a file called `assembly.sbt` and add it to your project/ directory. In our case, the file contains one line:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
3. Review build.sbt file for Apache Spark Cluster Deploy
name := "spark-sql-examples" version := "1.0" assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false) scalaVersion := "2.11.8" resolvers += "jitpack" at "https://jitpack.io" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided", "com.databricks" %% "spark-csv" % "1.3.0", "mysql" % "mysql-connector-java" % "5.1.12" )
Again, links to source code are included below in the reference section. At the root of the spark-sql/ directory, there is the above `build.sbt` file. In this file, there are a couple of lines worth discussing. First, the line beginning with `assemblyOption` is an instruction to our sbt-assembly plugin to not include Scala jars in the “fat jar” we’re going to build for Spark deploy.
Next, notice how we indicate the “spark-sql” library is already provided.
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
This indicates to sbt-assembly to not include it in the jar we’re going to assemble for deploy.
Everything else seems fairly standard to me, but let me know if you have any questions in the comments. Let’s keep going.
4. Package Spark Driver Program for Deploy
From your shell or editor, simply run `sbt assembly` to produce the fat jar. In this case, the file created will be target/scala-2.11/spark-sql-examples-assembly-1.0.jar. This is the jar will deploy with `spark-submit`.
5. Deploying Fat Jar to the Spark Cluster
Nothing out of the ordinary here. Just run `spark-submit` using the jar produced in Step 4. For example
spark-1.6.1-bin-hadoop2.4/bin/spark-submit --class "com.supergloo.SparkSQLJDBCApp" --master spark://todd-mcgraths-macbook-pro.local:7077 ./target/scala-2.11/spark-sql-examples-assembly-1.0.jar
Here’s a screencast from the Apache Spark with Scala training course which performs the steps above.
- sbt-assembly plugin https://github.com/sbt/sbt-assembly
- Source code https://github.com/tmcgrath/spark-course/tree/master/spark-sql
- The source code is based on two previous Spark SQL tutorials
- Spark SQL CSV Examples
- Spark SQL JDBC Example
- Apache Spark with Scala Cluster Part II Deploying
- Apache Spark Cluster on Amazon EC2 Tutorial
Featured image credit: https://flic.kr/p/qsyGca