Spark Tutorials

Table of Contents

Spark Tutorials with Scala and Python

You may wish to jump directly to the list of tutorials

or keep reading if you are new to Apache Spark

What is Apache Spark?

Apache Spark is an open-source big data processing framework built in Scala and Java.  Spark is known for its speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.

Apache Spark provides an interface to data structures called the Resilient Distributed Dataset (RDD).  RDDs provide an abstraction to a diverse set of possible data sources including structured, semi-structured and unstructured data.  Examples of possible datasets include any Hadoop compliant input sources, text files, graph data, relational databases, JSON, CSV, NoSQL databases as well as real-time streaming data from providers such as Kafka and Amazon Kinesis.

Providing a consistent interface to a multiple of input sources is one of the features which makes Spark attractive.  It’s especially beneficially to organizations attempting to find value from large and inconsistent data sets.  Additional features and benefits will be covered later in this tutorial.

At the end of this tutorial, readers will have an understanding of what Spark is, why it is gaining popularity in big data processing, how to use it, and when it may be an appropriate solution.

Why Spark?

Spark is an evolutionary step in how we address the challenges of Big Data.  The first step in big data processing was Hadoop.  Hadoop’s primary processing abstraction is MapReduce with Java.  MapReduce was an advance in Big Data processing.  MapReduce programs read input data from disk, `map` a function across the input data, and then `reduce` the results of the map, and finally, store reduction results on disk.

Spark was developed in response to the limitations of the MapReduce paradigm.

Fundamentals of Apache Spark

A core construct of Spark is the data abstraction layer called Resilient Distributed Datasets (RDD).  RDDs are utilized by developers, engineers, data scientists, and RDD compatible tool vendors through two categories of Spark API functions: Transformations and Actions.

RDDs, Spark transformations, and actions should be understood first.  Then, the learning journey may continue with hands-on learning and/or examine the architectural questions and possible solutions.

The following tutorial articles focus on these fundamentals

Transformations and Actions functions are accessible through Java, Scala, and Python APIs.  The ability to use multiple languages with Spark is another attractive feature of the framework because there are language choice options.  In the next section, the breakout of Spark API by language is covered.

How to Be Productive with Spark

Spark APIs are available for Java, Scala or Python.  The language to choose is highly dependent on the skills of your engineering teams and possibly corporate standards or guidelines.  Many data engineering teams choose Scala or Java for its type safety, performance, and functional capabilities.  Python is a great choice for data science teams looking to explore large datasets in an easier to use, more forgiving language.

But, depending on your situation, you may even find yourself using a combination of Java, Scala, and Python in Spark environments.  In any case, we have you covered the following tutorials organized around specific languages.

Python and Scala

Apache Spark Ecosystem Components

In addition to the previously described features and benefits, Spark is gaining popularity because of a vibrant ecosystem of component development.  These components augment Spark Core.  The following components are available for Spark:

Spark vs. Hadoop?

A common misconception is that Spark and Hadoop are competitors.  If the conversation is around whether Spark and MapReduce are competing approaches for solving the processing of big data, then, yeah, the answer could easily be yes.  The way MapReduce and Spark approach the problem of processing large amounts of data differs.  So, in one sense, they compete.

But, it’s just as important to know the Spark Hadoop or Hadoop Spark relationship is symbiotic.  Spark is able to leverage existing Hadoop-based infrastructure.  First and foremost, Spark can utilize YARN and HDFS.  This is the heart of most Hadoop environments.  In addition, Spark is able to integrate with Hive and HBase without jumping through a million hoops.

As you learn more about Apache Spark, Hadoop related questions will eventually arise.  While Spark presents an alternative to MapReduce,  Hadoop constructs such as YARN and HDFS are still valuable in Spark based solutions today and foreseeable future.

Featured image adapted from