Spark Submit Command Line Arguments


The primary reason why we want to use Spark submit command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible.

For example, let’s assume we want to run our Spark job in both test and production environments. Let’s further assume that our Spark job reads from a Cassandra database and the databases between test and production are different. In this example, we should prefer using dynamic configuration values when submitting the job to test vs production environments.  The alternative is hard-coding the Cassandra host connection in code, recompiling and deploying.  This approach wastes time.

Table of Contents

Spark Submit Command Line Arguments Overview

So, how do we process Spark submit command line arguments in your Scala code?  I would think this would be easy by now. But, I’ve been surprised at how difficult this can be when people are new to Scala and Spark.

(Reference: For more info on deploying with spark-submit to a spark cluster)

Of course, one way to achieve command line arg parsing is querying the String Array which is part of your `main` function. For example, perhaps your code looks similar to:

def main(args: Array[String]) {

  val conf = new SparkConf().setAppName("SparkCassandraExampleApp")
  if (args.length > 0) conf.setMaster(args(0))
  if (args.length >= 1) conf.set("spark.cassandra.connection.host", args(1))

...

But this isn’t good.  It’s brittle because we are not using default values and `args` values depend on a specific ordering.  Let’s address these two weaknesses in our solution.

See also  Spark Broadcast Variables When, Why, Examples, and Alternatives

In this tutorial, I’ll present a simple example of a flexible and scalable way to process command-line args in your Scala based Spark jobs. We’re going to use `scopt` library [1] and update our previous Spark with Cassandra example.

To update our previous Spark Cassandra example to use command-line arguments, we’re going to update two areas of the project: the SBT build file and our code.

Step 1 Update SBT Build File for Scopt Command Line Option Parsing Library

Easy.  Essential from:

libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-sql" % "1.6.1",
   "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0"
)

to:

libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-sql" % "1.6.1",
   "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0",
   "com.github.scopt" %% "scopt" % "3.5.0"
)

Step 2 Update Spark Scala code to process command line args

There are multiple areas of the code we need to update.  First, we’re going to update our code to use a case class to represent command-line options and utilize the `scopt` library to process.

A case class to hold and reference our command line args:

case class CommandLineArgs (
  cassandra: String = "", // required
  keyspace: String = "gameofthrones", // default is gameofthrones
  limit: Int = 10
)

Next, let’s update the code to use this case class and set possible values from the command line

val parser = new scopt.OptionParser[CommandLineArgs]("spark-cassandra-example") {
  head("spark-cassandra-example", "1.0")
  opt[String]('c', "cassandra").required().valueName("<cassandra-host>").
    action((x, c) => c.copy(cassandra = x)).
    text("Setting cassandra is required")
  opt[String]('k', "keyspace").action( (x, c) =>
    c.copy(keyspace = x) ).text("keyspace is a string with a default of `gameofthrones`")
  opt[Int]('l', "limit").action( (x, c) =>
    c.copy(limit = x) ).text("limit is an integer with default of 10")
}

So, there are a few interesting parts in this code.  Can you tell which command line variable is required?  Can you tell which requires an Int vs. String?  Sure, you can.  I have confidence in you.

See also  Spark Thrift Server with Cassandra Example

Ok, finally we need to update the code to make a decision based on whether the command line arg parsing succeeded or not.  The way we do that is through a pattern match as shown here

parser.parse(args, CommandLineArgs()) match {
  case Some(config) =>
  // do stuff
  case None => // failed
}

In this example, `config` is our `CommandLineArgs` case class which is available on success.  If the command line arg parsing succeeded, our code will enter the `Some(config)` match and “do stuff”.  There we can use the vars such as `config.keyspace`.

The alternative match is `None`.  In this match, we know the command line arg parsing failed, so the code should exit.

That’s it.  Hopefully, this simple example helps you move ahead with Spark command line argument parsing and usage in Scala.  For more details and exploring more options using `scopt` check out the site via reference link below or let me know if you have any questions or comments.

For the complete updated example, see this commit or cut-and-paste https://github.com/tmcgrath/spark-scala/blob/5f286e6543c87ff3d0cd64ade55f85ac32939007/got-battles/src/main/scala/com/supergloo/SparkCassandra.scala

Spark Submit Command Line References

[1] For more on `scopt` checkout https://github.com/scopt/scopt

Don’t miss other Spark Tutorials on Spark Scala Tutorials

Featured image credit https://flic.kr/p/QFCGLQ

About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

Leave a Comment