Kafka Streams Tutorial with Scala for Beginners Example

Kafka Streams Tutorial with Scala for Beginners

If you’re new to Kafka Streams, here’s a Kafka Streams Tutorial with Scala tutorial which may help jumpstart your efforts.  My plan is to keep updating the sample project, so let me know if you would like to see anything in particular with Kafka Streams with Scala.  In this example, the intention is to 1) provide an SBT project you can pull, build and run 2) describe the interesting lines in the source code.

The project is available to clone at https://github.com/tmcgrath/kafka-streams

Assumptions

This example assumes you’ve already downloaded Open Source or Confluent Kafka.  It’s run on a Mac in a bash shell, so translate as necessary.

The screencast below also assumes some familiarity with IntelliJ.

Kafka Streams Tutorial with Scala Quick Start

Let’s run the example first and then describe it in a bit more detail.

1. Start up Zookeeper

<KAFKA_HOME>/bin/zookeeper-server-start ./etc/kafka/zookeeper.properties

For example ~/dev/confluent-5.0.0/bin/zookeeper-server-start ./etc/kafka/zookeeper.properties

2. Start Kafka

<KAFKA_HOME>/bin/kafka-server-start ./etc/kafka/server.properties

3. Create a topic

<KAFKA_HOME>/bin/kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic text_lines

4. Create some sample content

echo -e "doo dooey do dodah\ndoo dooey do dodah\ndoo dooey do dodah" > words.txt

5. Send the content to a Kafka topic

cat words.txt | ./bin/kafka-console-producer --broker-list localhost:9092 --topic text_lines

6. Run it like you mean it.  I mean put some real effort into it now.  In screencast (below), I run it from IntelliJ, but no one tells you what to do.  You do it the way you want to… in SBT or via `kafka-run-class`

7. Verify the output like you just don’t care.  Yeah.

bin/kafka-console-consumer --bootstrap-server localhost:9092 \
        --topic word_count_results \
        --from-beginning \
        --formatter kafka.tools.DefaultMessageFormatter \
        --property print.key=true \
        --property key.deserializer=org.apache.kafka.common.serialization.StringDeserializer \
        --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer

 

ScreenCast

If the steps above left you feeling somewhat unsatisfied and putting you in a wanting-more kind of groove, a screencast is next.  Let’s run through the steps above in the following Kafka Streams Scala with IntelliJ example.  Prepare yourself.

Kafka Streams with Scala for Beginners Example

 

Kafka Streams

So, why Kafka Streams?  My first thought was it looks like Apache Spark.  The code itself doesn’t really offer me any compelling reason to switch.

But it is cool that Kafka Streams apps can be packaged, deployed, etc. without a need for a separate processing cluster.  Also, it was nice to be able to simply run in a debugger without any setup ceremony required when running cluster based code.

I’m intrigued by the idea of being able to scale out by adding more instances of the app.  In other words, this example could horizontally scale out by simply running more than one instance of `WordCount`.  Maybe I’ll explore that in a later post.

 

Kafka Streams TUTORIAL with Scala Source Code Breakout

When I started exploring Kafka Streams, there were two areas of the Scala code that stood out: the SerDes import and the use of KTable vs KStreams.

Kafka SerDes with Scala

This sample utilizes implicit parameter support in Scala.  This makes the code easier to read and more concise.  As shown in the above screencast, the ramifications of not importing are shown.  This is part of the Scala library which we set as a dependency in the SBT build.sbt file.  Serdes._ will bring `Grouped`, `Produced`, `Consumed` and `Joined` instances into scope.

import Serdes._
KTable and KStreams

The second portion of the Scala Kafka Streams code that stood out was the use of KTable and KStream.

I wondered what’s the difference between KStreams vs KTable?  Why would I use one vs the other?

KStreams are useful when you wish to consume records as independent, append-only inserts.  Think of records such as page views or in this case, individual words in text. Each word, regardless of past or future, can be thought of as an insert.

KStreams has operators that should look familiar functional combinators in Apache Spark – map, filter, etc.

KTable, on the other hand, represents each data record as an update rather than an insert.  In our example, we want an update on the count of words.  If a word as been previously counted to 2 and it appears again, we want to update the count to 3.

KTable operators will look familiar to SQL constructs… groupBy various Joins,etc.

——-

References

https://docs.confluent.io/3.1.1/streams/concepts.html#duality-of-streams-and-tables

https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#kafka-streams-dsl-for-scala

https://github.com/tmcgrath/kafka-streams

 

 

Kafka Streams with Scala post image credit https://pixabay.com/en/whiskey-bar-alcohol-glass-scotch-315178/

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.