How to Use Spark Submit Command to Deploy

Spark Submit Command Tutorial

Running spark submit to deploy your application to an Apache Spark Cluster is a required step towards Apache Spark proficiency.  As covered elsewhere on this site, Spark can use a variety of orchestration components used in spark submit command deploys such as YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR or even Kubernetes.  There … Read more

SparkSession, SparkContext, SQLContext in Spark [What’s the difference?]

How to choose between SparkContext, SQLContext and SparkSession

There have been some significant changes in the Apache Spark API over the years and when folks new to Spark begin reviewing source code examples, they will see references to SparkSession, SparkContext and SQLContext. Because this code looks so similar in design and purpose, users often ask questions such as “what’s the difference” and “why, … Read more

What is Apache Spark? An Essential Overview

What is Apache Spark?

Apache Spark is an open-source data processing engine designed for fast and big data processing. Originally developed at the University of California, Berkeley, in 2009, as an alternative to Hadoop MapReduce batch processing framework. Spark quickly became one of the most popular frameworks in big data analytics. Spark’s main advantage lies in its ability to … Read more

Spark FAIR Scheduler Example

Spark FAIR Scheduler Example

Scheduling in Spark can be a confusing topic.  When someone says “scheduling” in Spark, do they mean scheduling applications running on the same cluster?  Or, do they mean the internal scheduling of Spark tasks within the Spark application?  So, before we cover an example of utilizing the Spark FAIR Scheduler, let’s make sure we’re on … Read more

Apache Spark Thrift Server Load Testing Example

Spark Thrift Server Stress Test Tutorial

Wondering how to do perform stress tests with Apache Spark Thrift Server?  This tutorial will describe one way to do it. What is Apache Spark Thrift Server?   Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster.  From my … Read more

Spark Thrift Server with Cassandra Example

Spark thrift server with Cassandra

With the Spark Thrift Server, you can do more than you might have thought possible.  For example, want to use `joins` with Cassandra?  Or, help people familiar with SQL leverage your Spark infrastructure without having to learn Scala or Python?  They can use their existing SQL based tools they already know such as Tableau or … Read more

Spark Submit Command Line Arguments

Spark Command Line Arguments in Scala

The primary reason why we want to use Spark submit command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible. For example, let’s assume we want to run our Spark job in both test and production environments. … Read more

Spark RDD – A 2 Minute Guide for Beginners

spark rdd

What is a Spark RDD? Spark RDD is short for Apache Spark Resilient Distributed Dataset.  A Spark Resilient Distributed Dataset is often shortened to simply Spark RDD.  RDDs are a foundational component of the Apache Spark large scale data processing framework. Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements.  RDDs … Read more

Apache Spark with Amazon S3 Examples

Apache Spark with Amazon S3 setup

This post will show ways and options for accessing files stored on Amazon S3 from Apache Spark.  Examples of text file interaction on Amazon S3 will be shown from both Scala and Python using the spark-shell from Scala or ipython notebook for Python. To begin, you should know there are multiple ways to access S3 … Read more