Spark Broadcast Variables When and Why

Spark Broadcast Variables

Apache Spark broadcast variables are available to all nodes in the cluster. They are used to cache a value in memory on all nodes, so it can be efficiently accessed by tasks running on those nodes. For example, broadcast variables are useful with large values needing to be used in each Spark task. By using […]

Spark FAIR Scheduler Example

Spark FAIR Scheduler Example

Scheduling in Spark can be a confusing topic.  When someone says “scheduling” in Spark, do they mean scheduling applications running on the same cluster?  Or, do they mean the internal scheduling of Spark tasks within the Spark application?  So, before we cover an example of utilizing the Spark FAIR Scheduler, let’s make sure we’re on […]

Apache Spark Thrift Server Load Testing Example

Spark Thrift Server Stress Test Tutorial

Wondering how to do perform stress tests with Apache Spark Thrift Server?  This tutorial will describe one way to do it. What is Apache Spark Thrift Server?   Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster.  From my […]

Spark Thrift Server with Cassandra Example

With the Spark Thrift Server, you can do more than you might have thought possible.  For example, want to use `joins` with Cassandra?  Or, help people familiar with SQL leverage your Spark infrastructure without having to learn Scala or Python?  They can use their existing SQL based tools they already know such as Tableau or […]

Spark Submit Command Line Arguments

Spark Command Line Arguments in Scala

The primary reason why we want to use Spark submit command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible. For example, let’s assume we want to run our Spark job in both test and production environments. […]

Spark RDD – A Two Minute Guide for Beginners

spark rdd

What is Spark RDD? Spark RDD is short for Apache Spark Resilient Distributed Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply Spark RDD.  RDDs are a foundational component of the Apache Spark large scale data processing framework. Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements.  RDDs may […]

Apache Spark Advanced Cluster Deploy Troubleshooting

spark cluster deploy troubleshooting

In this Apache Spark cluster troubleshooting tutorial, we’ll review a few options when your Scala Spark code does not deploy as anticipated.  For example, does your Spark driver program rely on a 3rd party jar only compatible with Scala 2.11, but your Spark Cluster is based on Scala 2.10?  Maybe your code relies on a […]

Apache Spark with Amazon S3 Examples

Apache Spark with Amazon S3 setup

This post will show ways and options for accessing files stored on Amazon S3 from Apache Spark.  Examples of text file interaction on Amazon S3 will be shown from both Scala and Python using the spark-shell from Scala or ipython notebook for Python. To begin, you should know there are multiple ways to access S3 […]

How To: Apache Spark Cluster on Amazon EC2 Tutorial

Spark Cluster on EC2

How to set up and run an Apache Spark Cluster on EC2?  This tutorial will walk you through each step to get an Apache Spark cluster up and running on EC2. The cluster consists of one master and one worker node. It includes each step I took regardless if it failed or succeeded.  While your […]

What is Apache Spark?

What is Spark?

Becoming productive with Apache Spark requires an understanding of a few fundamental elements.  In this post, let’s explore the fundamentals or the building blocks of Apache Spark.  Let’s use descriptions and real-world examples in the exploration. The intention is for you is to understand basic Spark concepts.  It assumes you are familiar with installing software […]