Kafka Streams is another entry into the stream processing framework category with options to leverage from either Java or Scala. In this post, we’ll describe what is Kafka Streams, features and benefits, when to consider, how-to Kafka Stream tutorials, and external references. Ultimately, the goal of this post is to answer the question, why should you care?
What is Kafka Streams?
Kafka Streams is a client library for processing and analyzing data stored in Kafka. Developers use the Kafka Streams library to build stream processor applications when both the stream input and stream output are Kafka topic(s). We’ll cover stream processors and stream architectures throughout this tutorial.
Some of you may be wondering, why Kafka Streams over Kafka Connect or writing your own Kafka Consumer or Kafka Producer? What makes Kafka Streams different?
Well, to answer those questions, we should note one key difference right from the start.
As mentioned, Kafka Streams is used to write stream processors where the input and output are Kafka topics. Visually, an example of a Kafka Streams architecture may look like the following.
As see above, both the input and output of Kafka Streams applications are Kafka topics.
From this image, it appears Kafka Consumer and Kafka Producer APIs are being used. Is either of these APIs used in Kafka Streams based apps? Well, the answer is yes and no. Kafka Streams builds upon Kafka Producer and Consumer libraries, so it is able to leverage the native capabilities of Kafka such as fault tolerance, distributed coordination, and parallelism. But, it doesn’t expose the APIs directly. Instead, input and output to Kafka topics in Kafka Streams apps are available via Consumer and Producer abstractions such as the Kafka Streams DSL and Processor API. You’ll learn more about these in the Kafka Streams tutorial section below.
Ok, but why not configure and run a Kafka Connect based app? Good question, but as I understand it, Kafka Connect is essentially the E and L in ETL. Support for transformations is minimal as seen in the Kafka Connect Transformations documentation. In addition, the transformations are on a per message basis, or single message transforms, rather than across a stream. Finally, and somebody please correct me if I’m wrong here, but are there options for using a Kafka topic as either a sink or source in Kafka Connect?
KAFKA STREAMS FEATURES AND BENEFITS
Some of the factors I find appealing about Kafka Streams include:
- Lightweight – no external dependencies other than Kafka itself of course, yet you can still scale out horizontally
- Testable – it’s easy to write Kafka Streams test (see the tutorial section for links on how to test Kafka Streams)
- JVM – I don’t have to learn a new language. Can continue to use Scala as I can in Spark.
- Exactly-once processing guarantees. This is really powerful and shows how Kafka has evolved over the years. I’m a big fan but believe we need to be careful when considering processing vs delivery. For more, see my thoughts on exactly once in Kafka
- Abstraction DSL – as you can see from the Kafka tutorials, the code is very readable.
- Ability to perform both stateless transformations and stateful aggregations and joins
- Supports common stream processing constructs such as differences in meaning of time and windowing
- API is familiar coming from Apache Spark, so the learning curve is low
Why Kafka Streams?
To answer the question of why Kafka Streams, I believe we need to understand the role of Kafka Streams in a larger software architecture paradigm change from batch-oriented to stream-oriented. We need to speak of the emergence of stream processors and streaming architectures. For me, the person who influenced my thinking most on this topic most was Martin Kleppman. When I downloaded the freely available “Making Sense of Stream Processing” book, I had already experimented and deployed to production Kafka, Lambda and kappa architectures, CAP theorem and microservices, so I wasn’t expecting the book to be that impactful. But it was.
At the time of this writing, if you search for “Making Sense of Stream Processing”, you’ll find numerous sources where you can download this book for free.
If I could attempt to summarize visually, I would try by showing the older architectures of tight coupling components in an overall architectural stack has more drawbacks than advantages. A tightly coupled architecture such as this
should be stopped
and the introduction of a distributed ordered log like Kafka should be introduced to provide the backbone of streaming architectures and processors. It may look something like this
Now, what this last image does not show is the rise of stream processors. Stream processors provide value by providing curated results to Kafka topics. For example, a stream processor may pull from one or more distributed log inputs (i.e. Kafka topics), perform some transformations, aggregations, filtering, etc. across messages in the stream and then, write the results back to the distributed log output (i.e. Kafka topic). The results of a stream processor back into the stream is intended for the consumption somewhere downstream. In other words, this is an example of developing and deploying a stream processor. And that, girls and boys, is why-we-are-here as they say in the business. Kafka Streams is one option for creating stream processors when the input and output are both Kafka topics.
As noted on the Stream Processors and Streaming Architecture overview page, stream processors build upon key concepts such as the meaning of time, aggregations and windowing, stateless vs stateful, and processing vs delivery guarantees. Kafka Stream is no exception to these requirements and provides varying support for each one.
WHEN TO USE KAFKA STREAMS?
Hopefully, it’s obvious by now, but the appropriate time to consider Kafka Streams is when you are building streaming processors where both the input and output are Kafka Topics.
Kafka Streams Tutorials
This Kafka Streams overview will be fine for those of you looking to obtain a high-level understanding of Kafka Streams. But, for developers looking to gain hands-on experience with Kafka Streams, be sure to check out the Kafka Streams tutorials section of this site. Here you will be to experiment with all kinds of Kafka Streams use cases such as quick starts, automated testing, joining streams, etc.
Comparisons or Alternatives to Kafka Streams
Remember, Kafka Streams is designed for building Kafka based stream processors where a stream input is a Kafka topic and the stream processor output is a Kafka topic. This distinction is simply a requirement when considering other mechanisms for producing and consuming to Kafka. For example, you could build a stream processor with Spark Streaming and Kafka.
If your use case is only producing messages to Kafka or only consuming messages from Kafka then a Kafka Streams based stream processor may be the right choice. However, if you need to write your own code to build stream processors for more than just Kafka such as Kinesis or Pulsar or Google Pub/Sub, you may wish to consider alternatives such as Spark Streaming, Apache Flink or Apache Beam.