Spark Tutorials

For those that wish to jump directly to the list of tutorials:

Spark Tutorials with Scala
PySpark Tutorials
PySpark SQL Tutorials
And many more tutorial links at the bottom of this page

or keep reading if you are new to Apache Spark.

What is Apache Spark?
Why Apache Spark?
Fundamentals of Apache Spark
Which language to use with Spark?
Spark with Python and Scala
Apache Spark Ecosystem Components
Is Apache Spark batch or streaming?
Spark vs. Hadoop?
What is the license of Apache Spark?
Do companies provide proprietary versions of Apache Spark?
Is Spark available as a managed service from cloud providers?

What is Apache Spark?

Apache Spark is an open-source big data processing framework built in Scala and Java. Spark is known for its speed, ease of use, and sophisticated analytics. It was originally developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache project.

Apache Spark provides an interface to data structures called the Resilient Distributed Dataset (RDD). RDDs provide an abstraction to a diverse set of possible data sources including structured, semi-structured and unstructured data. Examples of possible datasets include any Hadoop compliant input sources, text files, graph data, relational databases, JSON, CSV, NoSQL databases as well as real-time streaming data from providers such as Kafka and Amazon Kinesis.

Providing a consistent interface to a multiple of input sources is one of the features which makes Spark attractive. It’s especially beneficially to organizations attempting to find value from large and inconsistent data sets. Additional features and benefits will be covered later in this tutorial.

At the end of this tutorial, readers will have an understanding of what Spark is, why it is gaining popularity in big data processing, how to use it, and when it may be an appropriate solution.

Why Apache Spark?

Spark was an evolutionary step in how we address the challenges of Big Data. The first step in big data processing was Hadoop. Hadoop’s primary processing abstraction is MapReduce with Java. MapReduce which, at the time, was an advance in Big Data processing.

MapReduce programs read input data from disk, `map` a function across the input data, and then `reduce` the results of the map, and finally, store reduction results on disk.

Apache Spark was developed in response to the limitations of the MapReduce paradigm.

Here are a few of the reasons why you should learn Apache Spark:

Scalability: Apache Spark is built to handle massive volumes of data in big data processing tasks. It can scale to accommodate petabytes of data over thousands of cluster nodes.

Speed: Apache Spark is well-known for its speed, processing data up to 100 times quicker than Hadoop MapReduce. Because of its in-memory processing capabilities, it may cache data in memory for speedier access.

Versatility: Apache Spark is versatile enough to be used for a wide range of data processing tasks, such as batch processing, real-time stream processing, machine learning, graph processing, and more.

Simplicity of use: Apache Spark can have a straightforward and intuitive API that makes writing sophisticated data processing tasks simple. It also interacts with a variety of other big data technologies and frameworks, such as Hadoop, Cassandra, and Kafka.

Opportunities for advancement: Apache Spark is a prominent tool in the big data and data science industries, and knowing it can lead to a variety of job prospects and advancement. Several businesses are looking for Apache Spark experts to help them manage and analyze their data.

Therefore, mastering Apache Spark can be a worthwhile investment in both your career and technical skills.

Fundamentals of Apache Spark

In the beginning, a core construct of Spark is the data abstraction layer called Resilient Distributed Datasets (RDD). RDDs are utilized by developers, engineers, data scientists, and RDD compatible tool vendors through two categories of Spark API functions: Transformations and Actions.

To me, RDDs, Spark transformations, and actions should be learned and understood first. Then, the learning journey should continue with hands-on learning and/or examine the architectural questions and possible solutions.

The following tutorials focus on these fundamentals

Transformations and Actions functions are accessible through Java, Scala, and Python APIs.

The ability to use multiple languages with Spark is feature of the framework because there are language choice options. In the next section, the breakout of Spark API by language is covered.

Which language to use with Spark?

As mentioned already, Spark APIs are available for Java, Scala, and/or Python. The language to choose is highly dependent on the skills of your engineering teams and possibly corporate standards or guidelines. Many data engineering teams choose Scala or Java for its type safety, performance, and functional capabilities. But, Python is a popular choice for data science teams looking to explore large datasets in an easier to use, more forgiving language.

But, depending on your situation, you may even find yourself using a combination of Java, Scala, and Python in Spark environments. In any case, we have you covered the following tutorials organized around specific languages.

Spark with Python and Scala

Apache Spark Ecosystem Components

In addition to the previously described features and benefits, Spark is gaining popularity because of a vibrant ecosystem of component development. These components complement the Spark Core API.

The following components are available for Spark:

Spark SQL with Scala

PySpark SQL Tutorials

Spark Streaming

Spark ML – Machine Learning

Is Apache Spark batch or streaming?

I often hear this question and the streaming portion can be slightly controversial at times.

Apache Spark provides both batch and real-time data processing.

Batch processing refers to the processing data in a single batch or set. Batch processing in Spark is often accomplished through the use of the RDD (Resilient Distributed Dataset), DataFrame APIs, or DataSet APIs, and is scheduled in a batch-oriented way.

Spark Structured Streaming, on the other hand, supports real-time or stream processing of data. Users can utilize Spark’s Structured Streaming API to process data streams in real-time utilizing a batch-oriented processing engine or “mini batch”. This means processing streaming data happens similar to batch but more often and with smaller batches of data. In the early days, Spark Streaming was only able to process data based on the time of arrival to Spark rather than the time the event actually occurred. I cover this later in more depth tutorials though.

So, to summarize, Apache Spark is a data processing engine used for batch and real-time data processing.

Spark vs. Hadoop?

A common misconception is that Spark and Hadoop are competitors. If the conversation is around whether Spark and MapReduce are competing approaches for solving the processing of big data, then, yeah, the answer could easily be yes. The way MapReduce and Spark approach the problem of processing large amounts of data differs. So, I guess in a sense, they do compete.

But, it’s just as important to know the Spark Hadoop or Hadoop Spark relationship is symbiotic. Spark is able to leverage existing Hadoop-based infrastructure. First and foremost, Spark can utilize YARN and HDFS. This is the heart of most Hadoop environments. In addition, Spark is able to integrate with Hive and HBase without jumping through a million hoops.

As you learn more about Apache Spark, Hadoop related questions will eventually arise. While Spark presents an alternative to MapReduce, Hadoop constructs such as YARN and HDFS are still valuable in Spark based solutions today and foreseeable future.

What is the license of Apache Spark?

Apache Spark is under the Apache License 2.0. This means the code is free to download, use, and alter, and there are no licensing costs or restrictions on its use.

Apache Spark License

While Apache Spark is open source, some companies sell proprietary versions of Spark with additional features and capabilities. These businesses also provide enterprise-level support and services for their proprietary versions of Spark, as well as tools and services which connect with Spark to give a whole big data solution.

Do companies provide proprietary versions of Apache Spark?

Companies providing proprietary versions of Apache Spark include:

Databricks: Databricks provides a proprietary version of Spark called Databricks Runtime, which adds additional capabilities and optimizations for cloud-based Spark workloads.

Cloudera distributes Spark as part of its Cloudera Data Platform, which comprises a variety of big data tools and services.

IBM provides a Spark distribution as part of their IBM Analytics Engine, which contains a variety of data processing and analysis capabilities.

While these proprietary versions of Spark may provide additional features and capabilities, they are not required to use Spark successfully. It is possible to run and manage Apache Spark on your own.

Now, some cloud providers which are covered below offer what they often call optimized versions of Apache Spark which could be considered proprietary. But, I didn’t do that. Rather, I cover the cloud providers of Spark in it’s own section below.

Is Spark available as a managed service from cloud providers?

Yes. Let’s cover the big three for now.

Spark on Azure

As part of its big data capabilities and Databricks partnership, Azure provides Apache Spark. HDInsight, Azure Databricks, and Azure Synapse Analytics are among the Azure tools and services that integrate with Apache Spark.

HDInsight is a fully managed cloud service that includes Apache Spark as well as Hadoop, Hive, and HBase big data processing engines. Users can simply establish and maintain Spark clusters in the cloud with HDInsight, and then use Spark to process and analyze their data.

Another data processing solution built on top of Apache Spark is Azure Databricks. It offers an optimized and fully managed version of Spark, complete with additional capabilities and interactions with other Azure services like as Azure Storage and Azure Data Lake Storage.

Finally, Azure Synapse Analytics is an analytics service that offers a variety of big data processing and analytics tools and services. It contains Apache Spark as well as other tools such as SQL Server and Power BI, and it gives data engineers and data scientists with a single experience.

Spark on AWS

AWS (Amazon Web Services) includes Apache Spark in its big data services. Amazon EMR (Elastic MapReduce), AWS Glue, and Amazon SageMaker are all services and tools that connect with Apache Spark.

Amazon EMR is a fully managed big data platform that includes Apache Spark as well as Hadoop, Hive, and Presto big data processing engines. EMR allows customers to establish and maintain Spark clusters in the cloud, as well as use Spark to process and analyze data.

AWS Glue is an ETL (Extract, Transform, Load) tool that allows customers to create, automate, and manage data pipelines. It supports native Apache Spark integration, allowing users to leverage Spark for data processing and transformation.

Amazon SageMaker is a machine learning service allowing customers to develop, train, and deploy machine learning models at scale. It recently added Apache Spark support, which allows users to leverage Spark for large-scale data processing and feature engineering.

Spark on GCP

Google Cloud Platform (GCP) includes Apache Spark in its big data offerings Cloud Dataproc and Cloud Dataflow.

Cloud Dataproc is a fully managed big data platform which includes Apache Spark and other big data processing engines such as Hadoop and Hive. Users can use Dataproc to simply construct and maintain Spark clusters in the cloud.

Cloud Dataflow is a serverless data processing tool to create, execute, and monitor data pipelines. It supports native Apache Spark integration.

Spark S3 Integration: A Comprehensive Guide

June 26, 2023June 25, 2023 by Todd M

Apache Spark is an open-source distributed computing system providing fast and general-purpose cluster-computing capabilities for big data processing. Amazon Simple Storage Service (S3) is a scalable, cloud storage service originally designed for online backup and archiving of data and applications on Amazon Web Services (AWS), but it has involved into basis of object storage for … Read more

How to Use Spark Submit Command to Deploy

May 2, 2023 by Todd M

Running spark submit to deploy your application to an Apache Spark Cluster is a required step towards Apache Spark proficiency. As covered elsewhere on this site, Spark can use a variety of orchestration components used in spark submit command deploys such as YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR or even Kubernetes. There … Read more

SparkSession, SparkContext, SQLContext in Spark [What’s the difference?]

April 27, 2023 by Todd M

How to choose between SparkContext, SQLContext and SparkSession

There have been some significant changes in the Apache Spark API over the years and when folks new to Spark begin reviewing source code examples, they will see references to SparkSession, SparkContext and SQLContext. Because this code looks so similar in design and purpose, users often ask questions such as “what’s the difference” and “why, … Read more

Spark Broadcast Variables When, Why, Examples, and Alternatives

April 13, 2023January 3, 2023 by Todd M

Apache Spark broadcast variables are available to all nodes in the cluster. They are used to cache a value in memory on all nodes, so it can be efficiently accessed by tasks running on those nodes. For example, broadcast variables are useful with large values needing to be used in each Spark task. By using … Read more

What is Apache Spark? An Essential Overview

June 28, 2023November 26, 2022 by Todd M

Apache Spark is an open-source data processing engine designed for fast and big data processing. Originally developed at the University of California, Berkeley, in 2009, as an alternative to Hadoop MapReduce batch processing framework. Spark quickly became one of the most popular frameworks in big data analytics. Spark’s main advantage lies in its ability to … Read more

Spark RDD – A 2 Minute Guide for Beginners

September 1, 2023May 24, 2022 by Todd M

Spark RDD is short for Apache Spark Resilient Distributed Dataset. A Spark Resilient Distributed Dataset is often shortened to simply Spark RDD. RDDs are a foundational component of the Apache Spark large scale data processing framework. What is a Spark RDD? Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements. RDDs … Read more

Spark FAIR Scheduler Example

October 14, 2022September 15, 2017 by Todd M

Scheduling in Spark can be a confusing topic. When someone says “scheduling” in Spark, do they mean scheduling applications running on the same cluster? Or, do they mean the internal scheduling of Spark tasks within the Spark application? So, before we cover an example of utilizing the Spark FAIR Scheduler, let’s make sure we’re on … Read more

Apache Spark Thrift Server Load Testing Example

November 28, 2022September 11, 2017 by Todd M

Spark Thrift Server Stress Test Tutorial

Wondering how to do perform stress tests with Apache Spark Thrift Server? This tutorial will describe one way to do it. What is Apache Spark Thrift Server? Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster. From my … Read more

Spark Thrift Server with Cassandra Example

April 13, 2023August 4, 2017 by Todd M

With the Spark Thrift Server, you can do more than you might have thought possible. For example, want to use `joins` with Cassandra? Or, help people familiar with SQL leverage your Spark infrastructure without having to learn Scala or Python? They can use their existing SQL based tools they already know such as Tableau or … Read more

Spark Submit Command Line Arguments

May 5, 2023March 29, 2017 by Todd M

The primary reason why we want to use Spark submit command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible. For example, let’s assume we want to run our Spark job in both test and production environments. … Read more

Apache Spark Advanced Cluster Deploy Troubleshooting

August 31, 2023May 18, 2016 by Todd M

In this Apache Spark cluster troubleshooting tutorial, we’ll review a few options when your Scala Spark code does not deploy as anticipated. For example, does your Spark driver program rely on a 3rd party jar only compatible with Scala 2.11, but your Spark Cluster is based on Scala 2.10? Maybe your code relies on a … Read more