Azure Kafka Connect Example – Blob Storage


In this Azure Kafka tutorial, let’s describe and demonstrate how to integrate Kafka with Azure’s Blob Storage with existing Kafka Connect connectors.  Let’s get a little wacky and cover writing to Azure Blob Storage from Kafka as well as reading from Azure Blob Storage to Kafka.  In this case, “wacky” is a good thing, I hope.

Two types of references are available for your pleasure.  Well, let’s admit the phrase “for your pleasure” is just a saying, you know, and people don’t often mean it when they say it.  Well, I mean it and I hope you find this Kafka with Azure Blob Storage tutorial valuable.

This pleasurable Azure Kafka Azure tutorial contains step-by-step command references, sample configuration file examples for sink and source connectors as well as screencast videos of me demonstrating the setup and execution of the examples.

If you have questions, comments or suggestions for additional content, let me know in the comments below.

Note: It is expected that you have at least some working knowledge of Apache Kafka at this point, but you may not be an expert yet.

Table of Contents

Overview

The overall goal here is to be focused on Azure Blob Storage Kafka integration through simple-as-possible examples.

Lastly, we are going to demonstrate the examples using Apache Kafka included in Confluent Platform instead of standalone Apache Kafka because the Azure Blob Storage sink and source connectors are commercial offerings from Confluent.

Here we go, let’s boogie.

Requirements

  1. An Azure account with enough permissions to be able to create storage accounts and container (more on this below)
  2. Azure CLI installed (Link in the Resources section below)
  3. Apache Kafka
  4. Download and install the Sink and Source Connectors into your Apache Kafka cluster (Links in the Resources section below)

Azure Blob Storage Setup

I’m going to paste the commands I ran to set up the Storage Container in Azure.  You will need to update the command variable values for your environment wherever appropriate.  Here’s a hint, at minimum, you need to change the tmcgrathstorageaccount and todd.   Those values are mine. You may wish to change other settings like the location variable as well.

1. az login

2. Create a resource group
az group create \
--name todd \
--location centralus

3. Create a storage account
az storage account create \
--name tmcgrathstorageaccount \
--resource-group todd \
--location centralus \
--sku Standard_LRS

For more on SKU types, https://docs.microsoft.com/en-us/rest/api/storagerp/srp_sku_types

4. Create a container
az storage container create \
--account-name tmcgrathstorageaccount \
--name kafka-connect-example \
--auth-mode login

5. For our Kafka Connect examples shown below, we need one of the two keys from the following command’s output.
az storage account keys list \
--account-name tmcgrathstorageaccount \
--resource-group todd \
--output table

Azure Blob Storage with Kafka Overview

When showing examples of connecting Kafka with Blob Storage, this tutorial assumes some familiarity with Apache Kafka, Kafka Connect, and Azure, as previously mentioned, but if you have any questions, just let me know.

Because both the Azure Blob Storage Sink and Source connectors are only available with a Confluent subscription or Confluent Cloud account, demonstrations will be conducted using Confluent Platform running on my laptop.  The goal of this tutorial is to keep things as simple as possible and provide a working example with the least amount of work for you.

Again, we will cover two types of Azure Kafka Blob Storage examples, so this tutorial is organized into two sections.  Section One is writing to Azure Blob Storage from Kafka with the Azure Blob Storage Sink Kafka Connector and the second section is an example of reading from Azure Blob Storage to Kafka.

[toc]

Kafka Connect Azure Blob Storage Examples

Let’s kick things off with a demo.  In this demo, I’ll run through both the Sink and Source examples.

Now, that we’ve seen working examples, let’s go through the commands that were run and configurations described.

Kafka Connect Azure Blob Storage Sink Example

In the screencast, I showed how to configure and run Kafka Connect with Confluent distribution of Apache Kafka as mentioned above. Afterward seeing a working example, I’ll document each of the steps in case you would like to try.

As you saw if you watched the video, the demo assumes you’ve downloaded the Confluent Platform already. I downloaded the tarball and have my $CONFLUENT_HOME variable set to /Users/todd.mcgrath/dev/confluent-5.4.1

The demo uses an environment variable called AZURE_ACCOUNT_KEY for the Azure Blob Storage Key when using the Azure CLI.

You will need key1 or key2 values from Step 5 in the Azure Blob Storage setup section above and set it in your .properties files.

Steps in screencast

  1. confluent local start
  2. Show sink connector already installed (I previously installed with confluent-hub install confluentinc/kafka-connect-azure-blob-storage:1.3.2)
  3. Note how I copied over the azure-blob-storage-sink.propertiesfile from my Github repo.  The link to Github repo can be found below.
  4. Show updates needed for this file
  5. Show empty Azure Blob Storage container named kafka-connect-example with a command adjusted for your key az storage blob list --account-name tmcgrathstorageaccount --container-name kafka-connect-example --output table --account-key $AZURE_ACCOUNT_KEY
  6. Generate 10 events of Avro test data with ksql-datagen quickstart=orders format=avro topic=orders maxInterval=100 iterations=10  See the previous post on test data in Kafka for reference on ways to generate test data into Kafka.
  7. confluent local load azure-bs-sink -- -d azure-blob-storage-sink.properties
  8. az storage blob list --account-name tmcgrathstorageaccount --container-name kafka-connect-example --output table --account-key $AZURE_ACCOUNT_KEY
  9. confluent local unload azure-bs-sink
  10. The second example is JSON output, so edit azure-blob-storage-sink.properties file
  11. Generate some different test data with confluent local config datagen-pageviews -- -d ./share/confluent-hub-components/confluentinc-kafka-connect-datagen/etc/connector_pageviews.config (Again, see link in References section below for the previous generation of test data in Kafka post)
  12. Start the sink connector back up with confluent local load azure-bs-sink -- -d azure-blob-storage-sink.properties
  13. List out the new JSON objects landed into Azure with `az storage blob list --account-name tmcgrathstorageaccount --container-name kafka-connect-example --output table --account-key $AZURE_ACCOUNT_KEY`

Azure Kafka Connect Blob Storage Source Example

If you made it through the Blob Storage Sink example above, you may be thinking the Source example will be pretty easy.  And depending on what time you are reading this, that might be true.  However, if you are reading this in Spring 2020 or so, it’s not exactly straight forward, but it’s not a huge deal either.  I’ll show you what to do.

First, the Azure Blob Storage Source connector is similar to the other source examples in Amazon Kafka S3 as well as GCP Kafka Cloud Storage.  They are similar in a couple of ways.  One, if you are also using the associated sink connector to write from Kafka to S3 or GCS and you are attempting to read this data back into Kafka, you may run into an infinite loop where what is written back to Kafka is written to the cloud storage and back to Kafka and so on.  This means use the Azure Kafka Blob Storage Source connector independent of the sink connector or use an SMT to transform when writing back to Kafka.  I’ll cover both of these below.

Another similarity is Azure Kafka Connector for Blob Storage requires a Confluent license after 30 days.  This means we will use the Confluent Platform in the following demo.

Please note: as warned above, at the time of this writing, I needed to remove some jar files from the source connector in order to proceed.  See “Workaround” section below.

Steps in screencast

  1. confluent local start (I had already installed the Source connector and made the updates described in “Workaround” section below)
  2. I copied over azure-blob-storage-source.propertiesfile from my Github repo.  Link below.
  3. Show no existing topics with kafka-topics --list --bootstrap-server localhost:9092
  4. az storage blob list --account-name tmcgrathstorageaccount --container-name kafka-connect-example --output table --account-key $AZURE_ACCOUNT_KEY which shows existing data on Azure Blob Storage from the previous Sink tutorial
  5. Load the source connector confluent local load azure-bs-source -- -d azure-blob-storage-source.properties
  6. kafkacat -b localhost:9092 -t orders -s avro -r http://localhost:8081
  7. In this example, the destination topic did not exist, so let’s simulate the opposite.  What would we do if the destination topic does exist?
  8. Modify azure-blob-storage-source.properties file.  Uncomment SMT transformation section.

Workaround (Spring 2020)

When attempting to use kafka-connect-azure-blob-storage-source:1.2.2 connector
with Confluent 5.4.1, the connector fails with the following

Caused by: java.lang.ClassCastException: io.netty.channel.kqueue.KQueueEventLoopGroup can
not be cast to io.netty.channel.EventLoopGroup

It can be resolved if the Azure Blob Storage Source’s Netty libs are removed; i.e.
rm -rf ./share/confluent-hub-components/confluentinc-kafka-connect-azure-blob-storage-source/lib/netty-*

Kafka Connect Azure Blob Storage Source Example with Apache Kafka

The Azure Blob Storage Kafka Connect Source is a commercial offering from Confluent as described above, so let me know in the comments below if you find more suitable for self-managed Kafka.  Thanks.

Azure Kafka Examples with Azure Blob Storage Helpful Resources

Featured image https://pixabay.com/photos/barrel-kegs-wooden-heritage-cask-52934/

See also  GCP Kafka Connect Google Cloud Storage Examples
About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

1 thought on “Azure Kafka Connect Example – Blob Storage”

  1. Hi, I have a requirement to migrate data from Azure blob to the Amazon S3 bucket. Your blog looks fine to me. I have worked on apache Kafka for a POC. Could you please help me with the same? It is an immediate requirement.

    Reply

Leave a Comment