Supergloo

Mastering PySpark: Most Popular PySpark Tutorials

January 12, 2024 by Todd M

As the demand for data processing and analytics continues to soar, PySpark has emerged as a powerful tool in the data streaming landscape. Here on supergloo.com, a hub for Pyspark tutorials, there are insights to help users harness the full potential of PySpark. In this blog recap post, let’s explore the top five pyspark tutorials … Read more

Navigating Compatibility: A Guide to Kafka Broker API Versions

January 11, 2024 by Todd M

Apache Kafka, renowned for its distributed streaming capabilities, relies on a well-defined set of APIs to facilitate communication between clients and brokers. Understanding the compatibility between Kafka clients and broker API versions is crucial for maintaining a stable and efficient streaming environment. In this blog post, we’ll delve into the realm of Kafka Broker API … Read more

How to Determine Kafka Connect License or Not?

January 4, 2024 by Todd M

Determining whether a Kafka Connect source or sink connector requires a license to be purchased usually depends on the specific connector and its vendor. Because Kafka Connect itself is an open-source framework, but individual connectors may have different licensing models. Put another way, as a developer and/or operator or anyone trying to build and create … Read more

Multi Tenant Kafka [4 Requirements, 1 Optional]

January 3, 2024 by Todd M

To implement a multi tenant Kafka architecture, several requirements need to be addressed in order to increase your chances of success. In this post, we will list and describe four requirements in multi tenant Kafka architectures which can lead to one optional configuration benefit. The final benefit will only be interesting depending on your unique … Read more

Kafka Terraform Integration: Simplifying Stream Processing Infrastructure Deployment

January 2, 2024 by Todd M

Apache Kafka has become a cornerstone in data processing and streaming architectures, offering robust publish-subscribe capabilities for handling real-time data. It’s well-regarded for its high-throughput, durability, and scalability which is essential for modern applications that rely on fast, reliable data streaming. Yet, managing Kafka clusters and their associated infrastructure can be complex, necessitating tools that … Read more

StreamConnectors.com Launches with Exclusive Kafka Connect Kinesis Source Connector – No Strings Attached!

December 31, 2023 by Todd M

New Kafka Connect Kinesis Source connector

We’re thrilled to announce the official launch of StreamConnectors.com, a data integration marketplace platform designed exclusively for streaming software developers and operators. At the heart of StreamConnectors.com is a commitment to providing a seamless experience for data integration, offering a diverse array of source and destination connectors, along with expert services to support your streaming … Read more

Spark Python [A Comprehensive Guide to Apache Spark with Python]

July 14, 2023 by Todd M

Spark Python is a data processing framework that has gained significant popularity in recent years. It is available both as an open-source project, as well as commercial providers such as Databricks, and provides a unified analytics engine for large-scale data processing. Spark Python is built on top of the Apache Spark project and provides a … Read more

PySpark JSON: A Comprehensive Guide to Working with JSON Data in PySpark

July 11, 2023 by Todd M

One of PySpark’s many strengths is its ability to handle JSON data. JSON, or JavaScript Object Notation, is a popular data format used for web applications and APIs. With PySpark, users can easily load, manipulate, and analyze JSON data in a distributed computing environment. This PySpark JSON tutorial will show numerous code examples of how … Read more

Kafka Topic Operations with kafka-topics.sh [4 Examples]

June 29, 2023June 29, 2023 by Todd M

How do Kafka administrators perform administrative and diagnostic collection actions of Kafka topics? This post explores a Kafka topic admin tool called kafka-topics.sh. This command-line tool included in Apache Kafka distributions. There are other examples of both open source and 3rd party tools which can also be used for Kafka topic administrative tasks, but for … Read more

Data Streaming 101 (and Real-Time Data Processing)

June 27, 2023 by Todd M

Data streaming is the process of continuously transmitting data from a source to a destination in real-time. It can be a method for transmitting large amounts of data quickly and efficiently vs a more traditional method of accumulating data over time, and then transmitting in scheduled batches. As with most options in software architecture, there … Read more

Spark S3 Integration: A Comprehensive Guide

June 26, 2023June 25, 2023 by Todd M

Apache Spark is an open-source distributed computing system providing fast and general-purpose cluster-computing capabilities for big data processing. Amazon Simple Storage Service (S3) is a scalable, cloud storage service originally designed for online backup and archiving of data and applications on Amazon Web Services (AWS), but it has involved into basis of object storage for … Read more

Apache Spark with Cassandra Example with Game of Thrones

September 5, 2023May 27, 2023 by Todd M

Spark Cassandra is a powerful combination of two open-source technologies that offer high performance and scalability. Spark is a fast and flexible big data processing engine, while Cassandra is a highly scalable and distributed NoSQL database. Together, they provide a robust platform for real-time data processing and analytics. One of the key benefits of using … Read more

PySpark Quick Start [Introduction to Apache Spark for Python Developers]

May 24, 2023May 22, 2023 by Todd M

In this PySpark quick start, let’s cover Apache Spark with Python fundamentals to get you started and feeling comfortable about using PySpark. The intention is for readers to understand basic PySpark concepts through examples. Later posts will deeper dive into Apache Spark fundamentals and example use cases. Apache Spark is a distributed computing framework widely used … Read more

Spark Read JSON: A Quick Guide in Scala

May 4, 2023 by Todd M

Spark Read JSON is a powerful capability allowing developers to read and query JSON files using Apache Spark. JSON, or JavaScript Object Notation, is a lightweight data-interchange format commonly used for data transfer. With Spark read JSON, users can easily load JSON data into Spark DataFrames, which can then be manipulated using Spark’s powerful APIs. … Read more

How to Use Spark Submit Command to Deploy

May 2, 2023 by Todd M

Running spark submit to deploy your application to an Apache Spark Cluster is a required step towards Apache Spark proficiency. As covered elsewhere on this site, Spark can use a variety of orchestration components used in spark submit command deploys such as YARN-based Spark Cluster running in Cloudera, Hortonworks or MapR or even Kubernetes. There … Read more

Spark Read JDBC Examples with mySQL

May 5, 2023May 1, 2023 by Todd M

In this Spark Read JDBC tutorial, we will cover using Spark SQL with a mySQL database. Spark’s read JDBC methods allows us to read data and create DataFrames from a relational database supporting JDBC connectivity. It is useful for a variety of reasons including leveraging Spark’s distributed computing capabilities for processing data stored in a … Read more

Beginning Spark Actions in Scala [9 Popular Examples]

October 20, 2023April 27, 2023 by Todd M

Spark actions are the operations which trigger a Spark job to compute and return a result to the Spark driver program or write data to an external storage system. Unlike Spark transformations, which only define a computation path but do not actually execute, actions force Spark to compute and produce a result. In this Spark … Read more

SparkSession, SparkContext, SQLContext in Spark [What’s the difference?]

April 27, 2023 by Todd M

How to choose between SparkContext, SQLContext and SparkSession

There have been some significant changes in the Apache Spark API over the years and when folks new to Spark begin reviewing source code examples, they will see references to SparkSession, SparkContext and SQLContext. Because this code looks so similar in design and purpose, users often ask questions such as “what’s the difference” and “why, … Read more

Begin Apache Spark Transformations in Scala [15 Examples]

April 27, 2023April 26, 2023 by Todd M

Spark Transformations produce a new Resilient Distributed Dataset (RDD) or DataFrame or DataSet depending on your version of Spark and knowing Spark transformations is a requirement to be productive with Apache Spark. This is true whether you are using Scala or Python. The best way to becoming productive and confident in anything is to actually … Read more

Spark Read CSV with Scala: A Comprehensive Guide

May 3, 2023April 22, 2023 by Todd M

In this Spark Read CSV in Scala tutorial, we will create a DataFrame from a CSV source and query it with Spark SQL. Both simple and advanced examples will be explored and cover topics such as inferring schema from the header row of a CSV file. ** Updated April 2023 ** Starting in Spark … Read more

Timing in R: Best Practices for Accurate Measurements

May 3, 2023April 20, 2023 by Todd M

Timing in R is a requirement for data analysis for R developers. It is required for optimizing your R code. In this tutorial we will cover all your options and describe pros and cons of each. Without knowing your options can be the difference between a success and failure if your code doesn’t perform well. … Read more

Kafka Namespaces Today [Options and 2 Examples]

April 7, 2023 by Todd M

Kafka namespaces are not directly supported in Apache Kafka, but there are two ways to implement namespace-like capability in Kafka. In this Kafka namespaces tutorial, we’ll cover both examples, history, options, why you might need namespaces, and much more. Let’s go. A quick note on how this tutorial is configured. In the beginning, I am … Read more

Kafka Quotas Simplified (Why and How)

April 4, 2023 by Todd M

Kafka quotas provide the ability to govern and control the broker resources used by Kafka clients. More broadly, with Kafka quotas, you can limit how the resources used on the entire Kafka cluster from Kafka clients. Kafka quotas are used for primarily two reasons: 1) prevent misbehaving client(s) from unintentionally or intentionally attempting to adverse … Read more

Streaming Analytics in 2023 – What, Why, and How

March 21, 2023 by Todd M

Streaming analytics continues to become more important because it lets businesses learn new things and make decisions in almost real time. This is especially relevant in fields like finance, health care, and manufacturing where the amount of time needed to make decisions is very critical. By the way, when measuring time in streaming analytics, you’ll … Read more

Kafka Configuration with kafka-configs.sh [Tutorial with 4 Examples]

April 15, 2023March 20, 2023 by Todd M

Apache Kafka includes a command-line tool named kafka-configs.sh used to obtain configuration values for various types of entities such as topics, clients, users, brokers, and loggers. But, using this tool to determine current configuration values at runtime can be more difficult to use than you might expect. This can be especially true if you want … Read more

Best Ways to Determine Apache Kafka Version [1 Right and 2 Wrong Ways]

April 15, 2023March 19, 2023 by Todd M

How to Determine Kafka Version the Right Way

Knowing the Kafka version you are using may not be as straightforward as you might think. For example, if you search for “kafka version” in your favorite search engine or chatbot, there are all kind of results. But, I took a look at many of the top results and became concerned because the answers provided … Read more

Easy Kafka ACL (How To Implement Kafka Authorization)

April 15, 2023March 18, 2023 by Todd M

Kafka Access Control Lists (ACLs) are a way to secure your Kafka cluster by specifying which users or client applications have access to which Kafka resources (topics, clusters, etc.). The process of authorizing or refusing access to particular resources or functions within a software application is referred to as “authorization” in software. It is the … Read more

Spark Streaming with Scala: Getting Started Guide

August 31, 2023March 9, 2023 by Todd M

Spark Streaming enables scalable, fault-tolerant processing of real-time data streams such as Kafka and Kinesis. Spark Streaming is an extension of the core Spark API that provides high-throughput processing of live data streams. Scala is a programming language that is designed to run on the Java Virtual Machine (JVM). It is a statically-typed language that … Read more

PySpark MySQL [Hands-on Example with JDBC]

July 17, 2023January 24, 2023 by Todd M

In order to use PySpark with MySQL, we must first establish a connection between the two systems. This can be done using a JDBC (Java Database Connectivity) driver, which allows PySpark to interact with MySQL and transfer data between the two systems. Once this connection is established, PySpark can extract data from MySQL, perform transformations … Read more

PySpark Read CSV with SQL Examples

June 26, 2023January 22, 2023 by Todd M

In this pyspark read csv tutorial, we will use Spark SQL with a CSV input data source using the Python API. We will continue to use the Uber CSV source file as used in the Getting Started with Spark and Python tutorial presented earlier. Also, this Spark SQL CSV tutorial assumes you are familiar with using … Read more

PySpark Transformations Tutorial [14 Examples]

May 8, 2023January 14, 2023 by Todd M

PySpark Transformation Key functions

Spark Broadcast Variables When, Why, Examples, and Alternatives

April 13, 2023January 3, 2023 by Todd M

Apache Spark broadcast variables are available to all nodes in the cluster. They are used to cache a value in memory on all nodes, so it can be efficiently accessed by tasks running on those nodes. For example, broadcast variables are useful with large values needing to be used in each Spark task. By using … Read more

Python Kafka in Two Minutes. Maybe Less.

December 30, 2022 by Todd M

Although Apache Kafka is written in Java, there are Python Kafka clients available for use with Kafka. In this tutorial, let’s go through examples of Kafka with Python Producer and Consumer clients. Let’s consider this a “Getting Started” tutorial. After completing this, you will be ready to proceed to more complex examples. But we need … Read more

Open Source Change Data Capture in 2023

December 29, 2022 by Todd M

Let’s consider three open source change data capture (CDC) options ready for production in the year 2023. Before we begin, let’s confirm we all see the CDC trend. To me, it seems everywhere you look these days is all about change data capture. From my perspective that wasn’t the case for many years. Do you … Read more

Kafka and Dead Letter Queues Support? Yes and No

April 15, 2023December 28, 2022 by Todd M

In this post, let’s answer the question of Kafka and Dead Letter Quest. But first, let’s start with an overview. A dead letter queue (DLQ) is a queue, or a topic in Kafka, used to hold messages which can not be processed successfully. The origin of DLQs are traditional messaging systems which were popular before … Read more

Deep dive into PySpark SQL Functions

December 28, 2022 by Todd M

PySpark SQL functions are available for use in the SQL context of a PySpark application. These functions allow us to perform various data manipulation and analysis tasks such as filtering and aggregating data, performing inner and outer joins, and conducting basic data transformations in PySpark. PySpark functions and PySpark SQL functions are not the same … Read more

PySpark DataFrames by Example

July 21, 2023December 23, 2022 by Todd M

PySpark DataFrames are a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external … Read more

Why Kafka Connect and Why Not?

December 22, 2022 by Todd M

Apache Kafka Connect is a development framework for data integration between Apache Kafka and other systems. It facilitates moving data between Kafka and other systems, such as databases, message brokers, and file systems. A connector which moves data INTO Kafka is called a “Source”, while a connector which moves data OUT OF Kafka is called … Read more

Learn PySpark withColumn in Code [4 Examples]

April 5, 2023December 21, 2022 by Todd M

The PySpark withColumn function is used to add a new column to a PySpark DataFrame or to replace the values in an existing column. To execute the PySpark withColumn function you must supply two arguments. The first argument is the name of the new or existing column. The second argument is the desired value to … Read more

PySpark UDFs Demystified: Learn with Step-by-Step Examples

July 16, 2023December 12, 2022 by Todd M

A PySpark UDF, or PySpark User Defined Function, is a powerful and flexible tool in PySpark. They allow users to define their own custom functions and then use them in PySpark operations. PySpark UDFs can provide a level of flexibility, customization, and control not possible with built-in PySpark SQL API functions. It can allow developers … Read more

Kafka Authentication Tutorial (with 5 Examples)

December 31, 2023December 11, 2022 by Todd M

Kafka provides multiple authentication options. In this tutorial, we will describe and show the authentication options and then configure and run a demo example of Kafka authentication. There are two primary goals of this tutorial: There are a few key subjects which must be considered when building a multi-tenant cluster, but it all starts with … Read more

Mastering PySpark Filter: A Step-by-Step Guide through Examples

July 17, 2023November 28, 2022 by Todd M

In PySpark, the DataFrame filter function, filters data together based on specified columns. For example, with a DataFrame containing website click data, we may wish to group together all the platform values contained a certain column. This would allow us to determine the most popular browser type used in website requests. Solutions like this may … Read more

PySpark groupBy Made Simple: Learn with 4 Real-Life Scenarios

July 16, 2023November 26, 2022 by Todd M

In PySpark, the DataFrame groupBy function, groups data together based on specified columns, so aggregations can be run on the collected groups. For example, with a DataFrame containing website click data, we may wish to group together all the browser type values contained a certain column, and then determine an overall count by each browser … Read more

What is Apache Spark? An Essential Overview

June 28, 2023November 26, 2022 by Todd M

Apache Spark is an open-source data processing engine designed for fast and big data processing. Originally developed at the University of California, Berkeley, in 2009, as an alternative to Hadoop MapReduce batch processing framework. Spark quickly became one of the most popular frameworks in big data analytics. Spark’s main advantage lies in its ability to … Read more

Kafka Connect REST API Essentials

November 19, 2022 by Todd M

Kafka Connect API Examples Swagger Screenshot

The Kafka Connect REST API endpoints are used for both administration of Kafka Connectors (Sinks and Sources) as well as Kafka Connect service itself. In this tutorial, we will explore the Kafka Connect REST API with examples. Before we dive into specific examples, we need to set the context with an overview of Kafka Connect … Read more

Kafka Consumer Groups with kafka-consumer-groups.sh

June 28, 2023November 17, 2022 by Todd M

Kafka Consumer Groups Operation Examples

How do Kafka administrators perform administrative and diagnostic collection actions of Kafka Consumer Groups? This post explores a Kafka Groups operations admin tool called kafka-consumer-groups.sh. This popular, command-line tool included in Apache Kafka distributions. There are other examples of both open source and 3rd party tools not included with Apache Kafka which can also be … Read more

PySpark Joins with SQL

November 26, 2022November 11, 2022 by Todd M

Use PySpark joins with SQL to compare, and possibly combine, data from two or more datasources based on matching field values. This is simply called “joins” in many cases and usually the datasources are tables from a database or flat file sources, but more often than not, the data sources are becoming Kafka topics. Regardless … Read more

PySpark Join Examples with DataFrame join function

June 26, 2023October 26, 2022 by Todd M

PySpark joins are used to combine data from two or more DataFrames based on a common field between them. There are many different types of joins. The specific join type used is usually based on the business use case as well as most optimal for performance. Joins can be an expensive operation in distributed systems … Read more

What You Need to Know About Debezium

October 16, 2022 by Todd M

If you’re looking for an application for change data capture which includes speed, durability, significant history in production deployments across a variety of use cases, then Debezium may be for you. This open-source platform provides streaming from a wide range of both relational and NoSQL based databases to Kafka or Kinesis. There are many advantages … Read more

Streaming Data Engineer Use Cases

October 11, 2022 by Todd M

As a streaming data engineer, we face many data integration challenges such as “How do we integrate this SaaS with this internal database?”, “Will a particular integration be real-time or batch?”, “How does the system we design recovery from possible failures?” and “If anyone has ever addressed a situation similar to mine before, how did … Read more

Schema Registry in Data Streaming [Options, Choices, Comparisons]

August 30, 2023October 5, 2022 by Todd M

A schema registry in data streaming use cases such as micro-service integration, streaming ETL, event driven architectures, log ingest stream processing, etc., is not a requirement, but there are numerous reasons for implementing one. The reasoning for schema registries in data streaming architectures are plentiful and have been covered extensively already. I’ve included some of … Read more

How To Generate Kafka Streaming Join Test Data By Example

November 14, 2022September 27, 2022 by Todd M

Why “Joinable” Streaming Test Data for Kafka? When creating streaming join applications in KStreams, ksqldb, Spark, Flink, etc. with source data in Kafka, it would be convenient to generate fake data with cross-topic relationships; i.e. a customer topic and an order topic with a value attribute of customer.id. In this example, we might want to … Read more

Spark RDD – A 2 Minute Guide for Beginners

September 1, 2023May 24, 2022 by Todd M

Spark RDD is short for Apache Spark Resilient Distributed Dataset. A Spark Resilient Distributed Dataset is often shortened to simply Spark RDD. RDDs are a foundational component of the Apache Spark large scale data processing framework. What is a Spark RDD? Spark RDDs are an immutable, fault-tolerant, and possibly distributed collection of data elements. RDDs … Read more

Spark Streaming Example – How to Stream from Slack

August 31, 2023March 22, 2022 by Todd M

Let’s write a Spark Streaming example in Scala which streams from Slack. This tutorial will show how to write, configure and execute the code, first. Then, the source code will be examined in detail. If you don’t have a Slack team, you can set one up for free. Let’s cover that too. Sound fun? Let’s … Read more

Deploy PySpark to a Spark Cluster with spark-submit [3 Examples]

July 18, 2023July 18, 2020 by Todd M

When your PySpark application is ready to deploy to production or to a pre-prod testing environment how do we do it? How do we deploy Python programs to a Spark Cluster? The short answer is, it depends on complexity of your PySpark application. Does the logic of your Python application reside all in one .py … Read more

Spark Streaming Testing with Scala by Example

September 5, 2023June 9, 2020 by Todd M

Stream processing applications built with Apache Spark Streaming provide organizations the ability to ingest and analyze real-time data from sources like Kafka, Kinesis, and more. However, like any complex distributed system, Spark Streaming applications require thorough testing to ensure correct functionality and prevent bugs or errors from causing issues in production. Comprehensive Spark Streaming testing … Read more

Kafka vs Amazon Kinesis: Choosing the Right Streaming Platform

September 5, 2023June 8, 2020 by Todd M

Kafka and Kinesis are two popular streaming data platforms that enable real-time data processing. Kafka is an open-source distributed streaming platform that is used for building real-time data pipelines and streaming applications. It is designed to handle high-volume data streams and provides features such as fault-tolerance and scalability. Kinesis, on the other hand, is a … Read more

Spark Structured Streaming with Kafka Example – Part 1

April 14, 2023June 3, 2020 by Todd M

Spark Structured Streaming with Kafka Examples

In this post, let’s explore an example of updating an existing Spark Streaming application to newer Spark Structured Streaming. We will start simple and then move to a more advanced Kafka Spark Structured Streaming examples. My original Kafka Spark Streaming post is three years old now. On the Spark side, the data abstractions have evolved … Read more

Running Kafka Connect [Standalone vs Distributed Mode Examples]

May 9, 2023May 9, 2020 by Todd M

Kafka Connect Examples Distributed and Standalone Modes

One of the many benefits of running Kafka Connect is the ability to run single or multiple workers in tandem. This is referred to as running Kafka Connect in Standalone or Distributed mode. Running multiple workers in distributed mode provides a way for horizontal scale-out which leads to increased capacity, automated resiliency, or both. For … Read more

GlobalKTable vs KTable in Kafka Streams

August 30, 2023May 4, 2020 by Todd M

Kafka Streams presents two options for materialized views in the forms of GlobalKTable vs KTables. We will describe the meaning of “materialized views” in a moment, but for now, let’s just agree there are pros and cons to GlobalKTable vs KTables. GlobalKTable vs. KTable Three Essential Factors The essential three factors in your decision of … Read more

What and Why Event Logs?

April 15, 2023April 28, 2020 by Todd M

Before we begin diving into event logs, let’s start with a quote from one of my software heroes. “The idea of structuring data as a stream of events is nothing new, and it is used in many differentfields. Even though the underlying principles are often similar, the terminology is frequentlyinconsistent across different fields, which can … Read more

Azure Kafka Connect Example – Blob Storage

August 30, 2023April 23, 2020 by Todd M

In this Azure Kafka tutorial, let’s describe and demonstrate how to integrate Kafka with Azure’s Blob Storage with existing Kafka Connect connectors. Let’s get a little wacky and cover writing to Azure Blob Storage from Kafka as well as reading from Azure Blob Storage to Kafka. In this case, “wacky” is a good thing, I … Read more

Stream Processing

April 13, 2023April 21, 2020 by Todd M

We choose Stream Processing as a way to process data more quickly than traditional approaches. But, how do we do Stream Processing? Is Stream Processing different than Event Stream Processing? Why do we need it? What are a few examples of event streaming patterns? How do we implement it? Let’s get into these questions. As … Read more

Kafka Certification Tips for Developers

August 30, 2023April 10, 2020 by Todd M

If you are considering Kafka Certification, this page describes what I did to pass the Confluent Certified Developer for Apache Kafka Certification exam. You may see it shortened to “ccdak confluent certified developer for apache kafka tests“. Good luck and hopefully this page is helpful for you! There are many reasons why you may wish … Read more

GCP Kafka Connect Google Cloud Storage Examples

August 30, 2023April 8, 2020 by Todd M

GCP Kafka Connect Google Cloud Storage GCS

In this GCP Kafka tutorial, I will describe and show how to integrate Kafka Connect with GCP’s Google Cloud Storage (GCS). We will cover writing to GCS from Kafka as well as reading from GCS to Kafka. Descriptions and examples will be provided for both Confluent and Apache distributions of Kafka. I’ll document the steps … Read more

Kafka Test Data Generation Examples

November 14, 2022April 7, 2020 by Todd M

After you start working with Kafka, you will soon find yourself asking the question, “how can I generate test data into my Kafka cluster?” Well, I’m here to show you have many options for generating test data in Kafka. In this post and demonstration video, we’ll cover a few of the ways you can generate … Read more

Kafka Connect S3 Examples

August 30, 2023April 4, 2020 by Todd M

In this Kafka Connect S3 tutorial, let’s demo multiple Kafka S3 integration examples. We’ll cover writing to S3 from one topic and also multiple Kafka source topics. Also, we’ll see an example of an S3 Kafka source connector reading files from S3 and writing to Kafka will be shown. Examples will be provided for both … Read more

PySpark Examples of Actions

May 8, 2023January 15, 2020 by Todd M

PySpark actions produce a computed value back to the Spark driver program. This is different from PySpark transformation functions which produce RDDs, DataFrames or DataSets in results. For example, an action function such as count will produce a result back to the Spark driver while a collect transformation function will not. These may seem easy … Read more

Kafka Streams – Transformations Examples

August 30, 2023February 13, 2019 by Todd M

Kafka Streams Transformations provide the ability to perform actions on Kafka Streams such as filtering and updating values in the stream. Kafka Stream’s transformations contain operations such as `filter`, `map`, `flatMap`, etc. and have similarities to functional combinators found in languages such as Scala. And, if you are coming from Spark, you will also notice … Read more

Stream Processor Windows

April 14, 2023January 31, 2019 by Todd M

When moving to stream processing architecture or building stream processors, you will soon face two choices. Will you process streams on an individual, per event basis? Or, will you collect and buffer multiple events/messages first, and then apply a function or join results to this collection of events? Examples of single event processing might be … Read more

Kafka Producer in Scala

December 30, 2022January 29, 2019 by Todd M

Kafka Producers are one of the options to publish data events (messages) to Kafka topics. Kafka Producers are custom coded in a variety of languages through the use of Kafka client libraries. The Kafka Producer API allows messages to be sent to Kafka topics asynchronously, so they are built for speed, but also Kafka Producers have the ability … Read more

Kafka Consumer in Scala

December 28, 2022January 27, 2019 by Todd M

In this Kafka Consumer tutorial, we’re going to demonstrate how to develop and run an example of Kafka Consumer in Scala, so you can gain the confidence to develop and deploy your own Kafka Consumer applications. At the end of this Kafka Consumer tutorial, you’ll have both the source code and screencast of how to … Read more

Kafka Consumer Groups by Example

April 7, 2023January 25, 2019 by Todd M

Kafka Consumer Groups are the way to horizontally scale out event consumption from Kafka topics… with failover resiliency. “With failover resiliency” you say!? That sounds interesting. Well, hold on, let’s leave out the resiliency part for now and just focus on scaling out. We’ll come back to resiliency later. When designing for horizontal scale-out, let’s … Read more

Kafka Streams Joins Examples

August 30, 2023January 22, 2019 by Todd M

Performing Kafka Streams Joins presents interesting design options when implementing streaming processor architecture patterns. There are numerous applicable scenarios, but let’s consider an application might need to access multiple database tables or REST APIs in order to enrich a topic’s event record with context information. For example, perhaps we could augment records in a topic with sensor … Read more

Kafka Streams Testing with Scala Part 1

August 30, 2023January 7, 2019 by Todd M

After experimenting with Kafka Streams with Scala, I started to wonder how one goes about Kafka Streams testing in Java or Scala. How does one create and run automated tests for Kafka Streams applications? How does it compare to Spark Streaming testing? In this tutorial, I’ll describe what I’ve learned so far. Also, if you … Read more

Kafka Streams Tutorial with Scala for Beginners Example

August 30, 2023January 2, 2019 by Todd M

If you’re new to Kafka Streams, here’s a Kafka Streams Tutorial with Scala tutorial which may help jumpstart your efforts. My plan is to keep updating the sample project, so let me know if you would like to see anything in particular with Kafka Streams with Scala. In this example, the intention is to 1) provide an SBT project you … Read more

Apache Kafka Architecture – Delivery Guarantees

April 5, 2023December 11, 2018 by Todd M

Apache Kafka offers message delivery guarantees between producers and consumers. For more background or information Kafka mechanics such as producers and consumers on this, please see Kafka Tutorial page. Kafka delivery guarantees can be divided into three groups which include “at most once”, “at least once” and “exactly once”. Which option sounds the most appealing? … Read more

How to Debug Scala Spark in IntelliJ

October 20, 2023December 7, 2018 by Todd M

Have you struggled to configure debugging in IntelliJ for your Spark programs? Yeah, me too. Debugging with Scala code was easy, but when I moved to Spark things didn’t work as expected. So, in this tutorial, let’s cover debugging Scala based Spark programs in IntelliJ tutorial. We’ll go through a few examples and utilize the occasional help … Read more

Change Data Capture – What Is It? How Does it Work?

April 14, 2023December 4, 2018 by Todd M

Change Data Capture is a mechanism to capture the changes in databases so they may be processed someplace other than the database or application(s) which made the change. This article will explain what change data capture (CDC) is, how it works, and why it’s important for businesses. Why? Why would we want to capture changes … Read more

Kafka Connect mySQL Examples

August 30, 2023November 14, 2018 by Todd M

In this Kafka Connect mysql tutorial, we’ll cover reading from mySQL to Kafka and reading from Kafka and writing to mySQL. Let’s run this on your environment. Now, it’s just an example and we’re not going to debate operations concerns such as running in standalone or distributed mode. The focus will be keeping it simple and get it working. We … Read more

Spark Kinesis Example – Moving Beyond Word Count

May 17, 2023October 20, 2017 by Todd M

If you are looking for Spark with Kinesis example, you are in the right place. This Spark Streaming with Kinesis tutorial intends to help you become better at integrating the two. In this tutorial, we’ll examine some custom Spark Kinesis code and also show a screencast of running it. In addition, we’re going to cover … Read more

Spark Performance Monitoring Tools – A List of Options

April 16, 2019September 18, 2017 by Todd M

Which Spark performance monitoring tools are available to monitor the performance of your Spark cluster? In this tutorial, we’ll find out. But, before we address this question, I assume you already know Spark includes monitoring through the Spark UI? And, in addition, you know Spark includes support for monitoring and performance debugging through the Spark … Read more

Spark FAIR Scheduler Example

October 14, 2022September 15, 2017 by Todd M

Scheduling in Spark can be a confusing topic. When someone says “scheduling” in Spark, do they mean scheduling applications running on the same cluster? Or, do they mean the internal scheduling of Spark tasks within the Spark application? So, before we cover an example of utilizing the Spark FAIR Scheduler, let’s make sure we’re on … Read more

Spark Performance Monitoring with History Server

April 13, 2023September 14, 2017 by Todd M

Spark Tutorial Perf Metrics with History Server

In this Apache Spark History Server tutorial, we will explore the performance monitoring benefits when using the Spark History server. This Spark tutorial will review a simple Spark application without the History server and then revisit the same Spark app with the History server. We will explore all the necessary steps to configure Spark History … Read more

Apache Spark Thrift Server Load Testing Example

November 28, 2022September 11, 2017 by Todd M

Spark Thrift Server Stress Test Tutorial

Wondering how to do perform stress tests with Apache Spark Thrift Server? This tutorial will describe one way to do it. What is Apache Spark Thrift Server? Apache Spark Thrift Server is based on the Apache HiveServer2 which was created to allow JDBC/ODBC clients to execute SQL queries using a Spark Cluster. From my … Read more

Spark Thrift Server with Cassandra Example

April 13, 2023August 4, 2017 by Todd M

With the Spark Thrift Server, you can do more than you might have thought possible. For example, want to use `joins` with Cassandra? Or, help people familiar with SQL leverage your Spark infrastructure without having to learn Scala or Python? They can use their existing SQL based tools they already know such as Tableau or … Read more

Spark Streaming with Kafka Example

April 13, 2023March 30, 2017 by Todd M

Spark Streaming with Kafka is becoming so common in data pipelines these days, it’s difficult to find one without the other. This tutorial will present an example of streaming Kafka from Spark. In this example, we’ll be feeding weather data into Kafka and then processing this data from Spark Streaming in Scala. As the data … Read more

Spark Submit Command Line Arguments

May 5, 2023March 29, 2017 by Todd M

The primary reason why we want to use Spark submit command line arguments is to avoid hard-coding values into our code. As we know, hard-coding should be avoided because it makes our application more rigid and less flexible. For example, let’s assume we want to run our Spark job in both test and production environments. … Read more

Spark Performance Monitoring with Metrics, Graphite and Grafana

April 14, 2023November 17, 2016 by Todd M

Spark is distributed with the Metrics Java library which can greatly enhance your abilities to diagnose issues with your Spark jobs. In this tutorial, we’ll cover how to configure Metrics to report to a Graphite backend and view the results with Grafana for Spark Performance Monitoring purposes. Spark Performance Monitoring Background If you already … Read more

Spark Broadcast and Accumulators by Examples

April 27, 2023July 12, 2016 by Todd M

Spark Shared Variables Broadcast and Accumulators

What do we do when we need each Spark worker task to coordinate certain variables and values with each other? This is when Spark Broadcast and Spark Accumulators may come into play. Think about it. Imagine we want each task to know the state of variables or values instead of simply independently returning action results back to the … Read more

IntelliJ Scala and Apache Spark Happy Together

October 20, 2023June 15, 2016 by Todd M

In this tutorial, we’re going to review one way to setup IntelliJ for Scala and Spark development. The IntelliJ Scala combination is the best, free setup for Scala and Spark development. And I have nothing against ScalaIDE (Eclipse for Scala) or using editors such as Sublime. I switched from Eclipse years ago and haven’t looked … Read more

Apache Spark Advanced Cluster Deploy Troubleshooting

August 31, 2023May 18, 2016 by Todd M

In this Apache Spark cluster troubleshooting tutorial, we’ll review a few options when your Scala Spark code does not deploy as anticipated. For example, does your Spark driver program rely on a 3rd party jar only compatible with Scala 2.11, but your Spark Cluster is based on Scala 2.10? Maybe your code relies on a … Read more