Kafka Tutorials

What is Apache Kafka?

Apache Kafka is an open-source, distributed messaging system which provides the foundation for data to be moved between systems in near real-time. Apache Kafka is often defined as a distributed commit log which partitioned and possibly replicated. It provides a messaging system that is fast, scalable, durable, distributed by design, and fault-tolerant. For large-scale data systems, this is the preferred choice by many when it comes to an ideal type of communication between systems.

Kafka Tutorials

To have a better understanding of Kafka, let us first have a glimpse of some of the most common Apache Kafka terminology.

First off, let’s discuss the concept of “topics”. To be able to send a message to Kafka from a Producer, you have to do so to a specific topic. The same thing holds true if you want to read a message. You need to read from one or more specific topics from a Consumer.

You can think of it, like this:

Kafka Topics

Messages are published by Kafka Producers into various Kafka “topics”. A topic is similar to a database table in the intention of organizing data into logical buckets. Those who subscribe to topics, on the other hand, are called Kafka Consumers.

Communication to and from Apache Kafka from Producers and Consumers is through a language agnostic TCP protocol, which is high-performing and simple. It assumes the responsibility for facilitating communications involving servers and clients.

Producers and Consumers are “loosely coupled” which means they are completely unaware of each other. This makes Apache Kafka an attractive option for integration between various producers and consumers within modern architectures. It also provides an ideal access point for building stream processing applications.

Topics and Logs

A topic is considered a key abstraction in Kafka. A topic can be considered a feed name or category where the messages will be published. Each topic is further subdivided into partitions. Partitions split data across different nodes by Kafka brokers. In each of the messages within the partition, there is an ID number assigned, which is technically known as the offset.

Kafka Topic Partitions

All of the messages published can be retained in the cluster, regardless if they have been consumed or not. For instance, if the configuration states that topic messages will be available only for one week, then it can be retained within such period only. Once the period has lapsed, it will be automatically deleted, which, in turn, will provide the system with additional space.

On a per consumer basis, only the meta-data, which is also technically known as the offset, is going to be retained. The consumers will have complete control of this retention. It is often consumed linearly, but the user can often go back and reprocess, basically because he or she is in control. This replay-ability option is one of things that makes Kafka different than traditional message queues.

Also, consumers of Kafka are lightweight, cheap, and they do not have a significant impact on the way the clusters will perform. This differs from more traditional message systems. Every consumer has the ability to maintain their own offset, and hence, their actions will not affect other consumers.

Kafka Partitions

Why partitions? Partitions are leveraged for two purposes. The first is to adjust the size of the topic to make it fit on a single node. The second purpose of partition is for parallelism or performance tuning. It makes it possible for one consumer to simultaneously browse messages in concurrent threads. This will be further discussed later on. Partitions can be configured to replicate to other nodes, or more often called brokers, in the Kafka cluster for resiliency and failover.

Distribution Across Brokers

Partitions may be configured for replication to provide fault tolerance. For every partition, regardless if it is replicated or not, there is a single designated leader. The leader is responsible for the handling of the requests for read-write. There will also be “Followers”, which can be zero or more nodes in the Kafka cluster. They will be responsible for replicating the actions undertaken by the leader. In the case of the failure of the leader, one of the followers will immediately assume the leadership position. We’ll cover more on Kafka fault tolerance in later tutorials.

Kafka License

Apache Kafka is released open source under an Apache License 2.0. For source code see Apache Kafka source code repo.

Why Kafka?

When compared to the conventional messaging system, one of the benefits of Kafka is that it has ordering guarantees. In the case of a traditional queue, messages are kept on the server based on the order at which they are kept. Then messages are pushed out based on the order of their storage and there is no specific timing requirement in their transmission. This means that their arrival to different consumers could be random. Hold the phone. Did you just write “could be random”? Think about that for 5 seconds. That may be fine in some use cases, but not others.

Kafka is often deployed as the foundational element in Data Streaming Architectures because it can facilitate loose coupling between various components in architectures such as databases, micro-services, Hadoop, data warehouses, distributed files systems, search applications, etc.

Guarantees

In a Kafka messaging system, we talk about various guarantees:

1. A message sent by a producer to a topic partition is appended to the Kafka log based on the order sent.

2. Consumers see messages in the order of their log storage partition.

3. For topics with replication factor N, Kafka will tolerate N-1 number of failures without losing the messages previously committed to the log.

An often confused and debated concept revolves around delivery guarantees of producers consumers which include at-least once, at most once, and exactly once. We will get into this in more depth in particular tutorials found below.

Kafka Tutorials and Examples

Architecture

Kafka Delivery Guarantees
Zookeeper
Kafka Topic Internals
Kafka Brokers

Kafka Examples

Apache Kafka Operations

Monitoring coming soon
Kafka Authentication

Kafka Testing

Kafka Comparisons

More coming soon, but to start us off

Kafka vs Kinesis
Kafka vs Pulsar
Kafka vs. Traditional Message Queues – What makes Kafka different?

Kafka Certification

Kafka Certification for Developers Tips and Tricks

Apache Kafka Ecosystem

Apache Kafka also contains key components in the ecosystem including Kafka Connect and Kafka Streams.

Kafka Connect is a framework to build connectors which stream data to or stream data out of a Kafka cluster. For more information, see the Kafka Connect tutorials section of this site.

Kafka Streams is a client library used for building applications such as stream processors that move data in or out of Kafka. For more information see the Kafka Streams tutorials section of this site.

Further Resources

Official Apache Kafka documentation

Featured image adapted from https://flic.kr/p/bGR8bZ

Navigating Compatibility: A Guide to Kafka Broker API Versions

January 11, 2024 by Todd M

Apache Kafka, renowned for its distributed streaming capabilities, relies on a well-defined set of APIs to facilitate communication between clients and brokers. Understanding the compatibility between Kafka clients and broker API versions is crucial for maintaining a stable and efficient streaming environment. In this blog post, we’ll delve into the realm of Kafka Broker API … Read more

Multi Tenant Kafka [4 Requirements, 1 Optional]

January 3, 2024 by Todd M

To implement a multi tenant Kafka architecture, several requirements need to be addressed in order to increase your chances of success. In this post, we will list and describe four requirements in multi tenant Kafka architectures which can lead to one optional configuration benefit. The final benefit will only be interesting depending on your unique … Read more

Kafka Terraform Integration: Simplifying Stream Processing Infrastructure Deployment

January 2, 2024 by Todd M

Apache Kafka has become a cornerstone in data processing and streaming architectures, offering robust publish-subscribe capabilities for handling real-time data. It’s well-regarded for its high-throughput, durability, and scalability which is essential for modern applications that rely on fast, reliable data streaming. Yet, managing Kafka clusters and their associated infrastructure can be complex, necessitating tools that … Read more

Kafka Topic Operations with kafka-topics.sh [4 Examples]

June 29, 2023June 29, 2023 by Todd M

How do Kafka administrators perform administrative and diagnostic collection actions of Kafka topics? This post explores a Kafka topic admin tool called kafka-topics.sh. This command-line tool included in Apache Kafka distributions. There are other examples of both open source and 3rd party tools which can also be used for Kafka topic administrative tasks, but for … Read more

Kafka Namespaces Today [Options and 2 Examples]

April 7, 2023 by Todd M

Kafka namespaces are not directly supported in Apache Kafka, but there are two ways to implement namespace-like capability in Kafka. In this Kafka namespaces tutorial, we’ll cover both examples, history, options, why you might need namespaces, and much more. Let’s go. A quick note on how this tutorial is configured. In the beginning, I am … Read more

Kafka Quotas Simplified (Why and How)

April 4, 2023 by Todd M

Kafka quotas provide the ability to govern and control the broker resources used by Kafka clients. More broadly, with Kafka quotas, you can limit how the resources used on the entire Kafka cluster from Kafka clients. Kafka quotas are used for primarily two reasons: 1) prevent misbehaving client(s) from unintentionally or intentionally attempting to adverse … Read more

Kafka Configuration with kafka-configs.sh [Tutorial with 4 Examples]

April 15, 2023March 20, 2023 by Todd M

Apache Kafka includes a command-line tool named kafka-configs.sh used to obtain configuration values for various types of entities such as topics, clients, users, brokers, and loggers. But, using this tool to determine current configuration values at runtime can be more difficult to use than you might expect. This can be especially true if you want … Read more

Best Ways to Determine Apache Kafka Version [1 Right and 2 Wrong Ways]

April 15, 2023March 19, 2023 by Todd M

How to Determine Kafka Version the Right Way

Knowing the Kafka version you are using may not be as straightforward as you might think. For example, if you search for “kafka version” in your favorite search engine or chatbot, there are all kind of results. But, I took a look at many of the top results and became concerned because the answers provided … Read more

Easy Kafka ACL (How To Implement Kafka Authorization)

April 15, 2023March 18, 2023 by Todd M

Kafka Access Control Lists (ACLs) are a way to secure your Kafka cluster by specifying which users or client applications have access to which Kafka resources (topics, clusters, etc.). The process of authorizing or refusing access to particular resources or functions within a software application is referred to as “authorization” in software. It is the … Read more

Python Kafka in Two Minutes. Maybe Less.

December 30, 2022 by Todd M

Although Apache Kafka is written in Java, there are Python Kafka clients available for use with Kafka. In this tutorial, let’s go through examples of Kafka with Python Producer and Consumer clients. Let’s consider this a “Getting Started” tutorial. After completing this, you will be ready to proceed to more complex examples. But we need … Read more

Kafka and Dead Letter Queues Support? Yes and No

April 15, 2023December 28, 2022 by Todd M

In this post, let’s answer the question of Kafka and Dead Letter Quest. But first, let’s start with an overview. A dead letter queue (DLQ) is a queue, or a topic in Kafka, used to hold messages which can not be processed successfully. The origin of DLQs are traditional messaging systems which were popular before … Read more

Kafka Authentication Tutorial (with 5 Examples)

December 31, 2023December 11, 2022 by Todd M

Kafka provides multiple authentication options. In this tutorial, we will describe and show the authentication options and then configure and run a demo example of Kafka authentication. There are two primary goals of this tutorial: There are a few key subjects which must be considered when building a multi-tenant cluster, but it all starts with … Read more

Kafka Consumer Groups with kafka-consumer-groups.sh

June 28, 2023November 17, 2022 by Todd M

Kafka Consumer Groups Operation Examples

How do Kafka administrators perform administrative and diagnostic collection actions of Kafka Consumer Groups? This post explores a Kafka Groups operations admin tool called kafka-consumer-groups.sh. This popular, command-line tool included in Apache Kafka distributions. There are other examples of both open source and 3rd party tools not included with Apache Kafka which can also be … Read more

How To Generate Kafka Streaming Join Test Data By Example

November 14, 2022September 27, 2022 by Todd M

Why “Joinable” Streaming Test Data for Kafka? When creating streaming join applications in KStreams, ksqldb, Spark, Flink, etc. with source data in Kafka, it would be convenient to generate fake data with cross-topic relationships; i.e. a customer topic and an order topic with a value attribute of customer.id. In this example, we might want to … Read more

Kafka vs Amazon Kinesis: Choosing the Right Streaming Platform

September 5, 2023June 8, 2020 by Todd M

Kafka and Kinesis are two popular streaming data platforms that enable real-time data processing. Kafka is an open-source distributed streaming platform that is used for building real-time data pipelines and streaming applications. It is designed to handle high-volume data streams and provides features such as fault-tolerance and scalability. Kinesis, on the other hand, is a … Read more

Kafka Certification Tips for Developers

August 30, 2023April 10, 2020 by Todd M

If you are considering Kafka Certification, this page describes what I did to pass the Confluent Certified Developer for Apache Kafka Certification exam. You may see it shortened to “ccdak confluent certified developer for apache kafka tests“. Good luck and hopefully this page is helpful for you! There are many reasons why you may wish … Read more

Kafka Test Data Generation Examples

November 14, 2022April 7, 2020 by Todd M

After you start working with Kafka, you will soon find yourself asking the question, “how can I generate test data into my Kafka cluster?” Well, I’m here to show you have many options for generating test data in Kafka. In this post and demonstration video, we’ll cover a few of the ways you can generate … Read more

Kafka Producer in Scala

December 30, 2022January 29, 2019 by Todd M

Kafka Producers are one of the options to publish data events (messages) to Kafka topics. Kafka Producers are custom coded in a variety of languages through the use of Kafka client libraries. The Kafka Producer API allows messages to be sent to Kafka topics asynchronously, so they are built for speed, but also Kafka Producers have the ability … Read more

Kafka Consumer in Scala

December 28, 2022January 27, 2019 by Todd M

In this Kafka Consumer tutorial, we’re going to demonstrate how to develop and run an example of Kafka Consumer in Scala, so you can gain the confidence to develop and deploy your own Kafka Consumer applications. At the end of this Kafka Consumer tutorial, you’ll have both the source code and screencast of how to … Read more

Kafka Consumer Groups by Example

April 7, 2023January 25, 2019 by Todd M

Kafka Consumer Groups are the way to horizontally scale out event consumption from Kafka topics… with failover resiliency. “With failover resiliency” you say!? That sounds interesting. Well, hold on, let’s leave out the resiliency part for now and just focus on scaling out. We’ll come back to resiliency later. When designing for horizontal scale-out, let’s … Read more

Apache Kafka Architecture – Delivery Guarantees

April 5, 2023December 11, 2018 by Todd M

Apache Kafka offers message delivery guarantees between producers and consumers. For more background or information Kafka mechanics such as producers and consumers on this, please see Kafka Tutorial page. Kafka delivery guarantees can be divided into three groups which include “at most once”, “at least once” and “exactly once”. Which option sounds the most appealing? … Read more