Spark Command Line Arguments in Scala Example

Spark Command Line Arguments in Scala

The primary reason why we want to use Spark command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible.

For example, let’s assume we want to run our Spark job in both test and production environments. Let’s further assume that our Spark job reads from Cassandra database and the databases between test and production are different. In this example, we should prefer using dynamic configuration values when submitting the job to test vs production environments.  The alternative is hard-coding the Cassandra host connection in code, recompiling and deploying.  This approach wastes time.

So, how do we process Spark command line arguments in your Scala code?  I would think this would be easy by now. But, I’ve been surprised at how difficult this can be when people are new to Scala and Spark.

Of course, one way to achieve command line arg parsing is querying the String Array which is part of your `main` function. For example, perhaps your code looks similar to:

def main(args: Array[String]) {

  val conf = new SparkConf().setAppName("SparkCassandraExampleApp")

  if (args.length > 0) conf.setMaster(args(0))

  if (args.length >= 1) conf.set("spark.cassandra.connection.host", args(1))

...

But this isn’t good.  It’s brittle because we are not using default values and `args` values depend on a specific ordering.  Let’s address these two weaknesses in our solution.

In this tutorial, I’ll present a simple example of a flexible and scalable way to process command-line args in your Scala based Spark jobs. We’re going to use `scopt` library [1] and update our previous Spark with Cassandra example.

To update our previous Spark Cassandra example to use command-line arguments, we’re going to update two areas of the project: the SBT build file and our code.

Step 1 Update SBT Build File for Scopt Command Line Option Parsing Library

Easy.  Essential from:

libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-sql" % "1.6.1",
   "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0"
)

to:

libraryDependencies ++= Seq(
   "org.apache.spark" %% "spark-sql" % "1.6.1",
   "com.datastax.spark" %% "spark-cassandra-connector" % "1.5.0",
   "com.github.scopt" %% "scopt" % "3.5.0"
)

 

Step 2 Update Spark Scala code to process command line args

There are multiple areas of the code we need to update.  First, we’re going to update our code to use a case class to represent command-line options and utilize the `scopt` library to process.

A case class to hold and reference our command line args:

case class CommandLineArgs (
  cassandra: String = "", // required
  keyspace: String = "gameofthrones", // default is gameofthrones
  limit: Int = 10
)

Next, let’s update the code to use this case class and set possible values from command line

val parser = new scopt.OptionParser[CommandLineArgs]("spark-cassandra-example") {
  head("spark-cassandra-example", "1.0")
  opt[String]('c', "cassandra").required().valueName("<cassandra-host>").
    action((x, c) => c.copy(cassandra = x)).
    text("Setting cassandra is required")
  opt[String]('k', "keyspace").action( (x, c) =>
    c.copy(keyspace = x) ).text("keyspace is a string with a default of `gameofthrones`")
  opt[Int]('l', "limit").action( (x, c) =>
    c.copy(limit = x) ).text("limit is an integer with default of 10")
}

So, there are a few interesting parts in this code.  Can you tell which command line variable is required?  Can you tell which requires an Int vs. String?  Sure, you can.  I have confidence in you.

Ok, finally we need to update the code to make a decision based on whether the command line arg parsing succeeded or not.  The way we do that is through a pattern match as shown here

parser.parse(args, CommandLineArgs()) match {

  case Some(config) =>
  // do stuff
  case None => // failed

}

In this example, `config` is our `CommandLineArgs` case class which is available on success.  If the command line arg parsing succeeded, our code will enter the `Some(config)` match and “do stuff”.  There we can use the vars such as `config.keyspace`.

The alternative match is `None`.  In this match, we know the command line arg parsing failed, so the code should exit.

 

That’s it.  Hopefully, this simple example helps you move ahead with Spark command line argument parsing and usage in Scala.  For more details and exploring more options using `scopt` check out the site via reference link below or let me know if you have any questions or comments.

For the complete updated example, see this commit or cut-and-paste https://github.com/tmcgrath/spark-scala/blob/5f286e6543c87ff3d0cd64ade55f85ac32939007/got-battles/src/main/scala/com/supergloo/SparkCassandra.scala

 

References

[1] For more on `scopt` checkout https://github.com/scopt/scopt

 

Featured image credit https://flic.kr/p/QFCGLQ

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.