Spark Scala with 3rd Party JARs Deploy to a Cluster

Spark Apache Cluster Deploy with 3rd Party Jars

Overview

In this Apache Spark cluster deploy tutorial, we’ll cover how to deploy Spark driver programs to a Spark cluster when the driver program utilizes third-party jars.  In this case, we’re going to use code examples from previous Spark SQL and Spark Streaming tutorials.

At the end of this tutorial, there is a screencast of all the steps.  Also, see Reference section below for Apache Spark Cluster Deploy Part I and II, source code reference and links to the Spark SQL and Spark Streaming tutorials.

Steps

  1. Source code layout
  2. Install assembly plugin
  3. Review build.sbt file
  4. Package Spark Driver Program for Deploy
  5. Deploy to Spark Cluster

1. Source Code Layout

As you see, there is nothing extraordinary about our source code layout.  We’re going to build with `sbt`, so there are the usual suspect directories and files including: src/main/scala, project, project/build.properites and build.sbt file.

2. Install SBT Assembly Plugin

In order to package our 3rd party dependencies into a convenient “fat jar”, we’re going to install and use the sbt-assembly plugin.  The plugin is described as “The goal is simple: Create a fat JAR of your project with all of its dependencies”.  Installing this plugin is simple, just create a file called `assembly.sbt` and add it to your project/ directory.  In our case, the file contains one line:

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")

3. Review build.sbt file for Apache Spark Cluster Deploy

name := "spark-sql-examples"

version := "1.0"

assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

scalaVersion := "2.11.8"

resolvers += "jitpack" at "https://jitpack.io"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % "1.6.1" % "provided",
  "com.databricks" %% "spark-csv" % "1.3.0",
  "mysql" % "mysql-connector-java" % "5.1.12"
)

Again, links to source code are included below in the reference section.  At the root of the spark-sql/ directory, there is the above `build.sbt` file.  In this file, there are a couple of lines worth discussing.  First, the line beginning with `assemblyOption` is an instruction to our sbt-assembly plugin to not include Scala jars in the “fat jar” we’re going to build for Spark deploy.

See also  Apache Spark Cluster Part 1: Run Standalone

Next, notice how we indicate the “spark-sql” library is already provided.

"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"

This indicates to sbt-assembly to not include it in the jar we’re going to assemble for deploy.

Everything else seems fairly standard to me, but let me know if you have any questions in the comments.  Let’s keep going.

4. Package Spark Driver Program for Deploy

From your shell or editor, simply run `sbt assembly` to produce the fat jar.  In this case, the file created will be target/scala-2.11/spark-sql-examples-assembly-1.0.jar.  This is the jar will deploy with `spark-submit`.

5. Deploying Fat Jar to the Spark Cluster

Nothing out of the ordinary here.  Just run `spark-submit` using the jar produced in Step 4.  For example

spark-1.6.1-bin-hadoop2.4/bin/spark-submit --class "com.supergloo.SparkSQLJDBCApp" --master spark://todd-mcgraths-macbook-pro.local:7077 ./target/scala-2.11/spark-sql-examples-assembly-1.0.jar

Screencast

Here’s a screencast from the Apache Spark with Scala training course which performs the steps above.

Resources

Featured image credit: https://flic.kr/p/qsyGca

Leave a Reply

Your email address will not be published. Required fields are marked *