Spark Streaming Tutorials

Below you will find spark streaming tutorials below are available in both Scala and Python.

But, before diving into the specific tutorials, please ensure you know the following Spark Streaming fundamentals first.

Spark Streaming Architecture Examples

What is Spark Streaming?

Streaming is an extension of core Apache Spark APIs and provides fault-tolerant stream processing from sources like Kafka, Kinesis, or TCP sockets. Streaming applications run continuously as opposed to the traditional Spark batch processing jobs which are scheduled to run and stop periodically.

Streaming provides API functions such as map, reduce, join, and window for creating the desired output of the continuous running stream.

Functions may be chained together to customize the result. The results of Streaming applications may be written further downstream to object stores, databases, or even back to the origin streaming storage event log such as Kafka or Kinesis where it can be consumed from another application.

Spark Streaming Processing Time vs Event Time

When Spark Streaming was first launched in Spark 0.7, it only allowed processing time processing, which is based on the time the data is processed by Spark Streaming, regardless of the timestamps associated with the data.

Beginning with Spark 1.2, Spark Streaming added support for event time processing, which is based on the time an event happened, as indicated by a timestamp linked with the event data.

Event time processing functionality enables Spark Streaming to manage out-of-order events and late-arriving data, which is critical in many real-world use cases.

Spark Streaming included a new API for setting windows and watermarks to support event time processing, allowing you to describe how to group data based on event timestamps and how to handle out-of-order events and late-arriving data.

For more on windows in stream processing

The Spark Streaming windowing and watermarking capabilities are designed to interact with event time processing and handle streaming data based on event time.

Spark Streaming Tutorial Types

To begin, you must understand streaming has undergone two significant types of releases over its short history.

The original release was called Spark Streaming while the second, modern release is called Spark Structured Streaming.

If you are new to streaming with Spark, this means two key takeaways for you. One, as you are learning, make sure to understand which version of streaming tutorial you are reading; i.e. is it Structured Streaming or not? Two, if given a choice, use Structured Streaming for any new development for performance and newer API reasons.

Spark Streaming vs. Spark Structured Streaming

Structured Spark Streaming engine built upon the Spark SQL engine. This means using the more modern Dataset/DataFrame API in Scala, Java, Python or R to build custom filters, aggregations, event-time windows, stream-to-batch, stream-to-stream joins, etc. We cover many examples of building these applications on this site.

Managed Apache Spark Streaming Offerings

Outside of Databricks, which most people know, there are options from both AWS via EMR as well as Azure. Spark Streaming is also used in the execution engine for managed products such as AWS Glue streaming.

Let me know if I’m missing any others.

Structured Streaming Alternatives

There are many options for creating continuously running stream processing applications, but if we limit comparison to distributed applications running on more than one node, the most popular alternative to Spark Structured Streaming is Apache Flink.

Resources

Dataset/DataFrame API

Spark Streaming Tutorials

Spark Streaming with Scala: Getting Started Guide

August 31, 2023March 9, 2023 by Todd M

Spark Streaming enables scalable, fault-tolerant processing of real-time data streams such as Kafka and Kinesis. Spark Streaming is an extension of the core Spark API that provides high-throughput processing of live data streams. Scala is a programming language that is designed to run on the Java Virtual Machine (JVM). It is a statically-typed language that … Read more

Spark Streaming Example – How to Stream from Slack

August 31, 2023March 22, 2022 by Todd M

Let’s write a Spark Streaming example in Scala which streams from Slack. This tutorial will show how to write, configure and execute the code, first. Then, the source code will be examined in detail. If you don’t have a Slack team, you can set one up for free. Let’s cover that too. Sound fun? Let’s … Read more

Spark Streaming Testing with Scala by Example

September 5, 2023June 9, 2020 by Todd M

Stream processing applications built with Apache Spark Streaming provide organizations the ability to ingest and analyze real-time data from sources like Kafka, Kinesis, and more. However, like any complex distributed system, Spark Streaming applications require thorough testing to ensure correct functionality and prevent bugs or errors from causing issues in production. Comprehensive Spark Streaming testing … Read more

Spark Structured Streaming with Kafka Example – Part 1

April 14, 2023June 3, 2020 by Todd M

Spark Structured Streaming with Kafka Examples

In this post, let’s explore an example of updating an existing Spark Streaming application to newer Spark Structured Streaming. We will start simple and then move to a more advanced Kafka Spark Structured Streaming examples. My original Kafka Spark Streaming post is three years old now. On the Spark side, the data abstractions have evolved … Read more

Spark Kinesis Example – Moving Beyond Word Count

May 17, 2023October 20, 2017 by Todd M

If you are looking for Spark with Kinesis example, you are in the right place. This Spark Streaming with Kinesis tutorial intends to help you become better at integrating the two. In this tutorial, we’ll examine some custom Spark Kinesis code and also show a screencast of running it. In addition, we’re going to cover … Read more

Spark Streaming with Kafka Example

April 13, 2023March 30, 2017 by Todd M

Spark Streaming with Kafka is becoming so common in data pipelines these days, it’s difficult to find one without the other. This tutorial will present an example of streaming Kafka from Spark. In this example, we’ll be feeding weather data into Kafka and then processing this data from Spark Streaming in Scala. As the data … Read more