If you’re new to Kafka Streams, here’s a Kafka Streams Tutorial with Scala tutorial which may help jumpstart your efforts. My plan is to keep updating the sample project, so let me know if you would like to see anything in particular with Kafka Streams with Scala. In this example, the intention is to 1) provide an SBT project you can pull, build and run 2) describe the interesting lines in the source code.
The project is available to clone at https://github.com/tmcgrath/kafka-streams
Kafka Streams Assumptions
This example assumes you’ve already downloaded Open Source or Confluent Kafka. It’s run on a Mac in a bash shell, so translate as necessary.
The screencast below also assumes some familiarity with IntelliJ.
Kafka Streams Tutorial with Scala Quick Start
Let’s run the example first and then describe it in a bit more detail.
1. Start up Zookeeper
For example ~/dev/confluent-5.0.0/bin/zookeeper-server-start ./etc/kafka/zookeeper.properties
2. Start Kafka
3. Create a topic
<KAFKA_HOME>/bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic text_lines
4. Create some sample content
echo -e "doo dooey do dodah\ndoo dooey do dodah\ndoo dooey do dodah" > words.txt
5. Send the content to a Kafka topic
cat words.txt | ./bin/kafka-console-producer --broker-list localhost:9092 --topic text_lines
6. Run it like you mean it. I mean put some real effort into it now. In screencast (below), I run it from IntelliJ, but no one tells you what to do. You do it the way you want to… in SBT or via `kafka-run-class`
7. Verify the output like you just don’t care. Yeah.
bin/kafka-console-consumer --bootstrap-server localhost:9092 \ --topic word_count_results \ --from-beginning \ --formatter kafka.tools.DefaultMessageFormatter \ --property print.key=true \ --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \ --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer
Kafka Streams ScreenCast
If the steps above left you feeling somewhat unsatisfied and putting you in a wanting-more kind of groove, a screencast is next. Let’s run through the steps above in the following Kafka Streams Scala with IntelliJ example. Prepare yourself.
So, why Kafka Streams? My first thought was it looks like Apache Spark. The code itself doesn’t really offer me any compelling reason to switch.
But it is cool that Kafka Streams apps can be packaged, deployed, etc. without a need for a separate processing cluster. Also, it was nice to be able to simply run in a debugger without any setup ceremony required when running cluster based code like Spark.
I’m intrigued by the idea of being able to scale out by adding more instances of the app. In other words, this example could horizontally scale out by simply running more than one instance of `WordCount`. Maybe I’ll explore that in a later post.
Kafka Streams Tutorial with Scala Source Code Breakout
When I started exploring Kafka Streams, there were two areas of the Scala code that stood out: the SerDes import and the use of KTable vs KStreams.
Kafka SerDes with Scala
This sample utilizes implicit parameter support in Scala. This makes the code easier to read and more concise. As shown in the above screencast, the ramifications of not importing are shown. This is part of the Scala library which we set as a dependency in the SBT build.sbt file. Serdes._ will bring `Grouped`, `Produced`, `Consumed` and `Joined` instances into scope.
KTable and KStreams
The second portion of the Scala Kafka Streams code that stood out was the use of KTable and KStream.
I wondered what’s the difference between KStreams vs KTable? Why would I use one vs the other?
KStreams are useful when you wish to consume records as independent, append-only inserts. Think of records such as page views or in this case, individual words in text. Each word, regardless of past or future, can be thought of as an insert.
KStreams has operators that should look familiar to functional combinators in Apache Spark Transformations such as map, filter, etc.
KTable, on the other hand, represents each data record as an update rather than an insert. In our example, we want an update on the count of words. If a word has been previously counted to 2 and it appears again, we want to update the count to 3.
KTable operators will look familiar to SQL constructs… groupBy various Joins, etc.
Kafka Streams with Scala post image credit https://pixabay.com/en/whiskey-bar-alcohol-glass-scotch-315178/