Kafka Streams presents two options for materialized views in the forms of GlobalKTable vs KTables. We will describe the meaning of “materialized views” in a moment, but for now, let’s just agree there are pros and cons to GlobalKTable vs KTables. Need to learn more about Kafka Streams in Java? Here’s a pretty good option […]
Kafka Streams Tutorial
Kafka Streams is a client library for processing and analyzing data stored in Kafka. Developers use the library to build stream processor applications when both the stream input and stream output are Kafka topic(s).
Some of you may be wondering, why Kafka Streams over Kafka Connect or writing your own Kafka Consumer or Kafka Producer? What makes it different?
Well, to answer those questions, we should note one key difference right from the start.
As mentioned, Kafka Streams is used to write stream processors where the input and output are Kafka topics. Visually, an example of a Kafka Streams architecture may look like the following.
As see above, both the input and output of Kafka Streams applications are Kafka topics.
From this image, it appears Kafka Consumer and Kafka Producer APIs are being used. Is either of these APIs used in Kafka Streams based apps? Well, the answer is yes and no. Kafka Streams builds upon Kafka Producer and Consumer libraries, so it is able to leverage the native capabilities of Kafka such as fault tolerance, distributed coordination, and parallelism. But, it doesn’t expose the APIs directly. Instead, input and output to Kafka topics in Kafka Streams apps are available via Consumer and Producer abstractions such as the Streams DSL and Processor API. You’ll learn more about these in the tutorial section below.
Ok, but why not configure and run a Kafka Connect based app? Good question, but as I understand it, Kafka Connect is essentially the E and L in ETL. Support for transformations is minimal as seen in the Kafka Connect Transformations documentation. In addition, the transformations are on a per message basis, or single message transforms, rather than across a stream.
KAFKA STREAMS FEATURES AND BENEFITS
Some of the factors I find appealing include:
- Lightweight – no external dependencies other than Kafka itself of course, yet you can still scale out horizontally
- Testable – it’s easy to write tests (see the tutorial section for links on how to test Kafka Streams)
- JVM – I don’t have to learn a new language. Can continue to use Scala as I can in Spark.
- Exactly-once processing guarantees. This is really powerful and shows how Kafka has evolved over the years. I’m a big fan but believe we need to be careful when considering processing vs delivery. For more, see my thoughts on exactly once in Kafka
- Abstraction DSL – as you can see from the tutorials, the code is very readable.
- Ability to perform both stateless transformations and stateful aggregations and joins
- Supports common stream processing constructs such as differences in meaning of time and windowing
- API is familiar coming from Apache Spark, so the learning curve is low
Why Kafka Streams?
To answer the question of why, I believe we need to understand the role of Kafka Streams in a larger software architecture paradigm change from batch-oriented to stream-oriented. We need to speak of the emergence of stream processors and streaming architectures. For me, the person who influenced my thinking most on this topic most was Martin Kleppman. When I downloaded the freely available “Making Sense of Stream Processing” book, I had already experimented and deployed to production Kafka, Lambda and kappa architectures, CAP theorem and microservices, so I wasn’t expecting the book to be that impactful. But it was.
At the time of this writing, if you search for “Making Sense of Stream Processing”, you’ll find numerous sources where you can download this book for free.
If I could attempt to summarize visually, I would try by showing the older architectures of tight coupling components in an overall architectural stack has more drawbacks than advantages. A tightly coupled architecture such as this
should be stopped
and the introduction of a distributed ordered log like Kafka should be introduced to provide the backbone of streaming architectures and processors. It may look something like this
Now, what this last image does not show is the rise of stream processors. Stream processors provide value by providing curated results to Kafka topics. For example, a stream processor may pull from one or more distributed log inputs (i.e. Kafka topics), perform some transformations, aggregations, filtering, etc. across messages in the stream and then, write the results back to the distributed log output (i.e. Kafka topic). The results of a stream processor back into the stream is intended for the consumption somewhere downstream. In other words, this is an example of developing and deploying a stream processor. And that, girls and boys, is why-we-are-here as they say in the business. Kafka Streams is one option for creating stream processors when the input and output are both Kafka topics.
As noted on the Stream Processors and Streaming Architecture overview page, stream processors build upon key concepts such as the meaning of time, aggregations and windowing, stateless vs stateful, and processing vs delivery guarantees. Kafka Streams has no exceptions to these requirements and provides varying support for each one.
WHEN TO USE?
Hopefully, it’s obvious by now, but the appropriate time to consider Kafka Streams is when you are building streaming processors where both the input and output are Kafka Topics.
Comparisons or Alternatives
Remember, Kafka Streams is designed for building Kafka based stream processors where a stream input is a Kafka topic and the stream processor output is a Kafka topic. This distinction is simply a requirement when considering other mechanisms for producing and consuming to Kafka. For example, you could build a stream processor with Spark Streaming and Kafka.
If your use case is only producing messages to Kafka or only consuming messages from Kafka then a Kafka Streams based stream processor may be the right choice. However, if you need to write your own code to build stream processors for more than just Kafka such as Kinesis or Pulsar or Google Pub/Sub, you may wish to consider alternatives such as Spark Streaming, Apache Flink or Apache Beam.
Kafka Streams Tutorial Examples
Kafka Streams – Transformations Examples
Kafka Streams Transformations provide the ability to perform actions on Kafka Streams such as filtering and updating values in the stream. Kafka Stream’s transformations contain operations such as `filter`, `map`, `flatMap`, etc. and have similarities to functional combinators found in languages such as Scala. And, if you are coming from Spark, you will also notice […]
Kafka Streams Joins Examples
Performing Kafka Streams Joins presents interesting design options when implementing streaming processor architecture patterns. There are numerous applicable scenarios, but let’s consider an application might need to access multiple database tables or REST APIs in order to enrich a topic’s event record with context information. For example, perhaps we could augment records in a topic with sensor […]
Kafka Streams Testing with Scala Part 1
After experimenting with Kafka Streams with Scala, I started to wonder how one goes about Kafka Streams testing in Java or Scala. How does one create and run automated tests for Kafka Streams applications? How does it compare to Spark Streaming testing? In this tutorial, I’ll describe what I’ve learned so far. Also, if you […]
Kafka Streams Tutorial with Scala for Beginners Example
If you’re new to Kafka Streams, here’s a Kafka Streams Tutorial with Scala tutorial which may help jumpstart your efforts. My plan is to keep updating the sample project, so let me know if you would like to see anything in particular with Kafka Streams with Scala. In this example, the intention is to 1) provide an SBT project you […]