Apache Kafka Connect is a development framework for data integration between Apache Kafka and other systems. It facilitates moving data between Kafka and other systems, such as databases, message brokers, and file systems. A connector which moves data INTO Kafka is called a “Source”, while a connector which moves data OUT OF Kafka is called […]
What is Kafka Connect?
Kafka Connect is a development framework for building Apache Kafka connectors. The framework provides Java based API for building a type of Kafka connector called either a Sink or a Source. A “Sink” connector is designed to stream data from Apache Kafka and push to a destination such as an object store (S3, HDFS), database (relational, NoSQL, columnar), search index such as Elastic, etc. while a “Source” connector is designed to read data from an external system such as a file system, object store, database, SaaS, etc. and stream to an Apache Kafka cluster. Building with the Connect framework provides a convenient and consistent approach to development, deployment, and management of Kafka connectors.
Kafka Connect Features
Connectors built within the framework will benefit from numerous features including consistent approach to offset management, horizontal scaleability, REST interface management, and monitoring. These features matter because it provides standardization in situations where workloads are running multiple connectors which is a common scenario. A key reason to deploy a Kafka cluster is separation between multiple consumers and producers.
There are hundreds of Connect connectors available under a variety of open-source and proprietary licenses. Depending on your use case, you may be able to leverage an existing connector for ingress or egress to and from Apache Kafka without writing any code.
The Connect API framework was added in the Kafka 0.9.0.0 release. It leverages the Producer and Consumer APIs internally.
Kafka Connect Key Concepts
There are a few key concepts terms to know when learning Kafka Connect. These terms include workers, tasks, plugins, converters, transformations, and dead letter queues. Let’s briefly describe each of these terms now so we have a basic, shared understanding when they are used later. Let’s keep it high level for now and get into more specifics later as we progress in understanding.
Connectors are either Sinks or Sources written within the framework and coordinates the desired data flow through Tasks.
A Task is the code implementation of how data is streamed to or from Kafka.
A Worker is running processes responsible for executing connectors and tasks.
A Converter supports particular data formats such as Avro, Protobuf, ByteArray, etc. when pushing or pulling from Kafka. It may be helpful to consider it a translator.
A Transformation is simple logic to transform messages as they flow through connectors. Two examples of Transformations are InsertField and RemoveField which perform transformations as their name implies: add or remove fields from the message.
A Dead Letter Queue is an approach to addressing messages that should be rejected for a variety of reasons was they flow to or from Kafka. A Dead Letter Queue is often abbreviated to DLQ.
It might be confusing when the term Plugin is used. A plugin can mean a Connector, Converter, or Transformation. When configuring a worker, you must specify the Plugin classpath so any available connector, converter, or transformation is available for the Worker at runtime.
Kafka Connect can be run executed in a standalone isolated process or distributed across multiple workers. It is often most straightforward to get started in standalone mode as it is the most widely covered mode in existing documentation. Standalone mode is another way of describing the act of running a single Worker. On the other hand, Distributed mode provides a mechanism for horizontal scale by providing a way to coordinate and distribute work across multiple Workers. Workers run Connectors and Tasks in a one-to-many fashion. For example, a particular Worker, regardless if running as Standalone or Distributed might be running (or have available to run) multiple connectors with associated task(s) for each connector.
For an example, see How To Run Kafka Connect in Standalone and Distributed Mode Examples.
As previously introduced earlier, scaling out is achieved by deployment of additional workers’ connector tasks in order to distribute and segment processing. But, some of you may be wondering, how do workers coordinate with each other? In other words how is the coordination of distributed tasks orchestrated? Orchestration is a key construct in distributed systems. In this case, multiple connect workers coordinating together is no different.
The orchestration answer is found in Kafka Consumer Group functionality. Orchestration of connect workers is built upon the constructs provided in Kafka Consumer Groups. By the way, this is true in Kafka Streams as well. Kafka Streams utilizes Consumer Group behavior in addition to Connect.
I’ve probably mentioned it before, but if you want to go deep in the Kafka ecosystem, you need to understand Kafka Consumer Group functionality.
A short version and example of Kafka Consumer Group functionality in Connect can be described when considering the act of rebalancing a particular workload. Rebalancing occurs when changes in connector(s) and their Worker’s tasks have been determined. For example, a new worker comes online, or a worker and it’s associated connectors’ tasks go offline. This is similar in concept to adding or removing Consumers from a Group. And much like Consumer Groups, rebalancing attempts to ensure each Worker has approximately the same amount of work.
Just like Consumer Groups, all Workers configured with the same
group.id will be orchestrated together to distribute processing through the act of rebalancing.
Note: rebalancing applies to workers only. If a Worker’s particular task fails, no rebalance is triggered because a task failure is unexpected. Failed tasks should be restarted via the REST API.
There are hundreds of connectors available today under a variety of open-source and proprietary licenses. I’ve included examples of Sink and Source connectors below and if you are looking for any examples not covered below, just let me know.
Management of connectors is combination of configuration files to initialize the startup followed by an available REST interface. After starting a worker in either Standalone or Distributed mode, all connectors found in the aforementioned configurable `plugin.path` will be available for operational management such as start and stop via the REST interface.
For example, a REST GET call to list all the available plugins on a Worker might look like
curl http://localhost:8083/connector-plugins | jq '.'
Or, an example to start a particular Connector with a REST POST call example
curl -X POST -H "Accept:application/json" -H "Content-Type: application/json" --data @joinable.json http://localhost:8083/connectors | jq '.'
Again, these are just examples of Connect management related REST API. Much more in-depth examples are provided in the specific connector tutorials below.
In addition to REST calls to determine status such as RUNNING, PAUSED, FAILED, etc. Connect expose performance metrics via JMX.
Each worker process has a variety of metrics and each connector and task have additional metrics. A worker process contains producer and consumer metrics in addition to metrics specific to Connect.
Kafka Connect is available as a managed service similar to Kafka. Some examples of providers of Kafka Connect as a managed service include offerings from Confluent Cloud, Amazon MSK Connect, and Aiven.
Kafka Connect Examples
Kafka Connect REST API Essentials
The Kafka Connect REST API endpoints are used for both administration of Kafka Connectors (Sinks and Sources) as well as Kafka Connect service itself. In this tutorial, we will explore the Kafka Connect REST API with examples. Before we dive into specific examples, we need to set the context with an overview of Kafka Connect […]
What You Need to Know About Debezium
If you’re looking for an application for change data capture which includes speed, durability, significant history in production deployments across a variety of use cases, then Debezium may be for you. This open-source platform provides streaming from a wide range of both relational and NoSQL based databases to Kafka or Kinesis. There are many advantages […]
Running Kafka Connect – Standalone vs Distributed Mode Examples
One of the many benefits of running Kafka Connect is the ability to run single or multiple workers in tandem. Running multiple workers provides a way for horizontal scale-out which leads to increased capacity and/or an automated resiliency. For resiliency, this means answering the question, “what happens if a particular worker goes offline for any […]
Azure Kafka Connect Example – Blob Storage
In this Azure Kafka tutorial, let’s describe and demonstrate how to integrate Kafka with Azure’s Blob Storage with existing Kafka Connect connectors. Let’s get a little wacky and cover writing to Azure Blob Storage from Kafka as well as reading from Azure Blob Storage to Kafka. In this case, “wacky” is a good thing, I […]
GCP Kafka Connect Google Cloud Storage Examples
In this GCP Kafka tutorial, I will describe and show how to integrate Kafka Connect with GCP’s Google Cloud Storage (GCS). We will cover writing to GCS from Kafka as well as reading from GCS to Kafka. Descriptions and examples will be provided for both Confluent and Apache distributions of Kafka. I’ll document the steps […]
Kafka Connect S3 Examples
In this Kafka Connect S3 tutorial, let’s demo multiple Kafka S3 integration examples. We’ll cover writing to S3 from one topic and also multiple Kafka source topics. Also, we’ll see an example of an S3 Kafka source connector reading files from S3 and writing to Kafka will be shown. Examples will be provided for both […]
Kafka Connect mySQL Examples
In this Kafka Connect mysql tutorial, we’ll cover reading from mySQL to Kafka and reading from Kafka and writing to mySQL. Let’s run this on your environment. Now, it’s just an example and we’re not going to debate operations concerns such as running in standalone or distributed mode. The focus will be keeping it simple and get it working. We […]