Apache Kafka includes a command-line tool named kafka-configs.sh used to obtain configuration values for various types of entities such as topics, clients, users, brokers, and loggers. But, using this tool to determine current configuration values at runtime can be more difficult to use than you might expect. This can be especially true if you want […]
What is Apache Kafka?
Apache Kafka is an open-source, distributed messaging system which provides the foundation for data to be moved between systems in near real-time. Apache Kafka is often defined as a distributed commit log which partitioned and possibly replicated. It provides a messaging system that is fast, scalable, durable, distributed by design, and fault-tolerant. For large-scale data systems, this is the preferred choice by many when it comes to an ideal type of communication between systems.
To have a better understanding of Kafka, let us first have a glimpse of some of the most common Apache Kafka terminology.
First off, let’s discuss the concept of “topics”. To be able to send a message to Kafka from a Producer, you have to do so to a specific topic. The same thing holds true if you want to read a message. You need to read from one or more specific topics from a Consumer.
You can think of it, like this:
Messages are published by Kafka Producers into various Kafka “topics”. A topic is similar to a database table in the intention of organizing data into logical buckets. Those who subscribe to topics, on the other hand, are called Kafka Consumers.
Communication to and from Apache Kafka from Producers and Consumers is through a language agnostic TCP protocol, which is high-performing and simple. It assumes the responsibility for facilitating communications involving servers and clients.
Producers and Consumers are “loosely coupled” which means they are completely unaware of each other. This makes Apache Kafka an attractive option for integration between various producers and consumers within modern architectures. It also provides an ideal access point for building stream processing applications.
Topics and Logs
A topic is considered a key abstraction in Kafka. A topic can be considered a feed name or category where the messages will be published. Each topic is further subdivided into partitions. Partitions split data across different nodes by Kafka brokers. In each of the messages within the partition, there is an ID number assigned, which is technically known as the offset.
All of the messages published can be retained in the cluster, regardless if they have been consumed or not. For instance, if the configuration states that topic messages will be available only for one week, then it can be retained within such period only. Once the period has lapsed, it will be automatically deleted, which, in turn, will provide the system with additional space.
On a per consumer basis, only the meta-data, which is also technically known as the offset, is going to be retained. The consumers will have complete control of this retention. It is often consumed linearly, but the user can often go back and reprocess, basically because he or she is in control. This replay-ability option is one of things that makes Kafka different than traditional message queues.
Also, consumers of Kafka are lightweight, cheap, and they do not have a significant impact on the way the clusters will perform. This differs from more traditional message systems. Every consumer has the ability to maintain their own offset, and hence, their actions will not affect other consumers.
Why partitions? Partitions are leveraged for two purposes. The first is to adjust the size of the topic to make it fit on a single node. The second purpose of partition is for parallelism or performance tuning. It makes it possible for one consumer to simultaneously browse messages in concurrent threads. This will be further discussed later on. Partitions can be configured to replicate to other nodes, or more often called brokers, in the Kafka cluster for resiliency and failover.
Distribution Across Brokers
Partitions may be configured for replication to provide fault tolerance. For every partition, regardless if it is replicated or not, there is a single designated leader. The leader is responsible for the handling of the requests for read-write. There will also be “Followers”, which can be zero or more nodes in the Kafka cluster. They will be responsible for replicating the actions undertaken by the leader. In the case of the failure of the leader, one of the followers will immediately assume the leadership position. We’ll cover more on Kafka fault tolerance in later tutorials.
Apache Kafka is released open source under an Apache License 2.0. For source code see Apache Kafka source code repo.
When compared to the conventional messaging system, one of the benefits of Kafka is that it has ordering guarantees. In the case of a traditional queue, messages are kept on the server based on the order at which they are kept. Then messages are pushed out based on the order of their storage and there is no specific timing requirement in their transmission. This means that their arrival to different consumers could be random. Hold the phone. Did you just write “could be random”? Think about that for 5 seconds. That may be fine in some use cases, but not others.
Kafka is often deployed as the foundational element in Data Streaming Architectures because it can facilitate loose coupling between various components in architectures such as databases, micro-services, Hadoop, data warehouses, distributed files systems, search applications, etc.
In a Kafka messaging system, we talk about various guarantees:
1. A message sent by a producer to a topic partition is appended to the Kafka log based on the order sent.
2. Consumers see messages in the order of their log storage partition.
3. For topics with replication factor N, Kafka will tolerate N-1 number of failures without losing the messages previously committed to the log.
An often confused and debated concept revolves around delivery guarantees of producers consumers which include at-least once, at most once, and exactly once. We will get into this in more depth in particular tutorials found below.
Kafka Tutorials and Examples
- Kafka Delivery Guarantees
- Kafka Topic Internals
- Kafka Brokers
- Kafka Consumer in Scala
- Kafka Consumer Groups
- Kafka Producer in Scala
- Python Kafka Producer and Consumer Examples
Apache Kafka Operations
- Monitoring coming soon
- Kafka Authentication
More coming soon, but to start us off
- Kafka vs Kinesis
- Kafka vs Pulsar
- Kafka vs. Traditional Message Queues – What makes Kafka different?
Apache Kafka Ecosystem
Apache Kafka also contains key components in the ecosystem including Kafka Connect and Kafka Streams.
Kafka Connect is a framework to build connectors which stream data to or stream data out of a Kafka cluster. For more information, see the Kafka Connect tutorials section of this site.
Kafka Streams is a client library used for building applications such as stream processors that move data in or out of Kafka. For more information see the Kafka Streams tutorials section of this site.
Featured image adapted from https://flic.kr/p/bGR8bZ
Best Ways to Determine Apache Kafka Version [1 Right and 2 Wrong Ways]
Knowing the Kafka version you are using may not be as straightforward as you might think. For example, if you search for “kafka version” in your favorite search engine or chatbot, there are all kind of results. But, I took a look at many of the top results and became concerned because the answers provided […]
Easy Kafka ACL (How To Implement Kafka Authorization)
Kafka Access Control Lists (ACLs) are a way to secure your Kafka cluster by specifying which users or client applications have access to which Kafka resources (topics, clusters, etc.). The process of authorizing or refusing access to particular resources or functions within a software application is referred to as “authorization” in software. It is the […]
Python Kafka in Two Minutes. Maybe Less.
Although Apache Kafka is written in Java, there are Python Kafka clients available for use with Kafka. In this tutorial, let’s go through examples of Kafka with Python Producer and Consumer clients. Let’s consider this a “Getting Started” tutorial. After completing this, you will be ready to proceed to more complex examples. But we need […]
Kafka and Dead Letter Queues Support? Yes and No
In this post, let’s answer the question of Kafka and Dead Letter Quest. But first, let’s start with an overview. A dead letter queue (DLQ) is a queue, or a topic in Kafka, used to hold messages which can not be processed successfully. The origin of DLQs are traditional messaging systems which were popular before […]
Effortless Kafka Authentication Tutorial (with 5 Examples)
Kafka provides multiple authentication options. In this tutorial, we will describe and show the authentication options and then configure and run a demo example of Kafka authentication. There are two primary goals of this tutorial: There are a few key subjects which must be considered when building a multi-tenant cluster, but it all starts with […]
Kafka Consumer Groups with kafka-consumer-groups.sh
How do Kafka administrators perform administrative and diagnostic collection actions of Kafka Consumer Groups? This post explores a Kafka Groups operations admin tool called kafka-consumer-groups.sh. This popular, command-line tool included in Apache Kafka distributions. There are other examples of both open source and 3rd party tools not included with Apache Kafka which can also be […]
How To Generate Kafka Streaming Join Test Data By Example
Why “Joinable” Streaming Test Data for Kafka? When creating streaming join applications in KStreams, ksqldb, Spark, Flink, etc. with source data in Kafka, it would be convenient to generate fake data with cross-topic relationships; i.e. a customer topic and an order topic with a value attribute of customer.id. In this example, we might want to […]
Kafka Certification Tips for Developers
If you are considering Kafka Certification, this page describes what I did to pass the Confluent Certified Developer for Apache Kafka Certification exam. You may see it shortened to “ccdak confluent certified developer for apache kafka tests“. Good luck and hopefully this page is helpful for you! There are many reasons why you may wish […]
Kafka Test Data Generation Examples
After you start working with Kafka, you will soon find yourself asking the question, “how can I generate test data into my Kafka cluster?” Well, I’m here to show you have many options for generating test data in Kafka. In this post and demonstration video, we’ll cover a few of the ways you can generate […]
Kafka Producer in Scala
Kafka Producers are one of the options to publish data events (messages) to Kafka topics. Kafka Producers are custom coded in a variety of languages through the use of Kafka client libraries. The Kafka Producer API allows messages to be sent to Kafka topics asynchronously, so they are built for speed, but also Kafka Producers have the ability […]
Kafka Consumer in Scala
In this Kafka Consumer tutorial, we’re going to demonstrate how to develop and run an example of Kafka Consumer in Scala, so you can gain the confidence to develop and deploy your own Kafka Consumer applications. At the end of this Kafka Consumer tutorial, you’ll have both the source code and screencast of how to […]
Kafka Consumer Groups by Example
Kafka Consumer Groups are the way to horizontally scale out event consumption from Kafka topics… with failover resiliency. “With failover resiliency” you say!? That sounds interesting. Well, hold on, let’s leave out the resiliency part for now and just focus on scaling out. We’ll come back to resiliency later. When designing for horizontal scale-out, let’s […]
Apache Kafka Architecture – Delivery Guarantees
Apache Kafka offers message delivery guarantees between producers and consumers. For more background or information Kafka mechanics such as producers and consumers on this, please see Kafka Tutorial page. Kafka delivery guarantees can be divided into three groups which include “at most once”, “at least once” and “exactly once”. at most once which can lead to […]
Kafka vs Amazon Kinesis – How do they compare?
Apache Kafka vs. Amazon Kinesis The question of Kafka vs Kinesis often comes up. Let’s start with Kinesis. *** Updated Spring 2020 *** Since this original post, AWS has released MSK. I think this tells us everything we need to know about Kafka vs Kinesis. Also, since the original post, Kinesis has been separated into […]