How to Debug Scala Spark in IntelliJ

Spark Scala Debug

Have you struggled to configure debugging in IntelliJ for your Spark programs?  Yeah, me too.  Debugging with Scala code was easy, but when I moved to Spark things didn’t work as expected.  So, in this tutorial, let’s cover debugging Scala based Spark programs in IntelliJ tutorial.  We’ll go through a few examples and utilize the occasional help […]

Spark Broadcast and Accumulator Examples

Spark Shared Variables Broadcast and Accumulators

On this site, we’ve learned about distributing processing tasks across a Spark cluster.  But, let’s go a bit deeper in a couple of approaches you may need when designing distributed tasks.  I’d like to start with a question.  What do we do when we need each Spark worker task to coordinate certain variables and values with […]

IntelliJ Scala and Apache Spark

Intellij Scala Spark

IntelliJ Scala and Spark Setup Overview In this tutorial, we’re going to review one way to setup IntelliJ for Scala and Spark development.  The IntelliJ Scala combination is the best, free setup for Scala and Spark development.  And I have nothing against ScalaIDE (Eclipse for Scala) or using editors such as Sublime.  I switched from […]

Apache Spark with Cassandra and Game of Thrones

Spark Cassandra tutorial

Apache Spark with Cassandra is a powerful combination in data processing pipelines.  In this tutorial, we will build a Scala application with Spark and  Cassandra with battle data from Game of Thrones.  Now, we’re not going to make any show predictions!   But, we will show the most aggressive kings as well as kings which […]

Spark Scala with 3rd Party JARs Deploy to a Cluster

Spark Apache Cluster Deploy with 3rd Party Jars

Overview In this Apache Spark cluster deploy tutorial, we’ll cover how to deploy Spark driver programs to a Spark cluster when the driver program utilizes third-party jars.  In this case, we’re going to use code examples from previous Spark SQL and Spark Streaming tutorials. At the end of this tutorial, there is a screencast of […]

Apache Spark Cluster Part 2: Deploy Scala Program

How do you deploy a Scala program to a Spark Cluster?  In this tutorial, we’ll cover how to build, deploy and run a Scala driver program to a Spark Cluster.  The focus will be on a simple example in order to gain confidence and set the foundation for more advanced examples in the future.   To […]

Apache Spark Cluster Part 1: Run Standalone

Spark console

Running an Apache Spark Cluster on your local machine is a natural and early step towards Apache Spark proficiency.  As I imagine you are already aware, you can use a YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR.  There are numerous options for running a Spark Cluster in Amazon, Google or Azure as well.  […]

Apache Spark Examples of Actions in Scala

Spark Action Examples in Scala When using Spark API “action” functions, a result is produced back to the Spark Driver.  Computing this result will trigger any of the RDDs, DataFrames or DataSets needed in order to produce the result.  Recall Spark Transformations such as map, flatMap, and other transformations are used to create RDDs, DataFrames […]

Apache Spark Transformations in Scala Examples

Spark Transformation Examples

Spark Transformations in Scala Examples Spark Transformations produce a new Resilient Distributed Dataset (RDD) or DataFrame or DataSet depending on your version of Spark.  Resilient distributed datasets are Spark’s main and original programming abstraction for working with data distributed across multiple nodes in your cluster.  RDDs are automatically parallelized across the cluster. In the Scala […]