Kafka Connect

Kafka Connect is a development framework for building Apache Kafka connectors. The framework provides Java based API for building a type of Kafka connector called either a Sink or a Source. A “Sink” connector is designed to stream data from Apache Kafka and push to a destination such as an object store (S3, HDFS), database (relational, NoSQL, columnar), search index such as Elastic, etc. while a “Source” connector is designed to read data from an external system such as a file system, object store, database, SaaS, etc. and stream to an Apache Kafka cluster. Building with the Connect framework provides a convenient and consistent approach to development, deployment, and management of Kafka connectors.

Kafka Connect Sink Source Diagram

Why Kafka Connect?

Kafka Connect is designed to simplify the process of integrating Kafka with external systems, making it easier to build data pipelines in a consistent and scaleable fashion.

A key advantage of connectors built on Kafka Connect is its ability to handle a wide range of data sources and sinks. There are many available options. It provides a range of connectors that can be used to integrate Kafka with external systems, such as databases, file systems, and messaging systems. This makes it easy to build end-to-end data pipelines that can handle a wide range of data formats and sources.

Kafka Connect is also designed to be scalable and fault-tolerant. It can handle large volumes of data by distributing the workload across a cluster of computers, making it possible to process data in parallel. It also provides a range of features for managing and monitoring data pipelines, such as automatic schema detection and compatibility checking.

Kafka Connect Features

Connectors built within the framework will benefit from numerous features including consistent approach to offset management, horizontal scaleability, REST interface management, and monitoring. These features matter because it provides standardization in situations where workloads are running multiple connectors which is a common scenario. A key reason to deploy a Kafka cluster is separation between multiple consumers and producers.

There are hundreds of Connect connectors available under a variety of open-source and proprietary licenses. Depending on your use case, you may be able to leverage an existing connector for ingress or egress to and from Apache Kafka without writing any code.

Kafka Connect Multiple Connectors Diagram

Kafka Connect History

The Connect API framework was added in the Kafka 0.9.0.0 release. It leverages the Producer and Consumer APIs internally.

Learning Kafka Connect

When learning Kafka Connect, it is essential you know a few key concepts, such as workers, tasks, plugins, converters, transformations, and dead letter queues.

Connectors are either sinks or sources written within the framework, which coordinates the desired data flow through tasks; in or out of Kafka. We will explore in more detail but at a high level, a worker is responsible for executing connectors and tasks. A converter supports particular data formats such as Avro, Protobuf, JSON, ByteArray, etc. when pushing or pulling from Kafka. A transformation is simple logic to transform messages as they flow through connectors. A dead letter queue is an approach to addressing messages that should be rejected for a variety of reasons as they flow to or from Kafka.

Kafka Connect Key Concepts

There are a few key concepts terms to know when learning Kafka Connect. These terms include workers, tasks, plugins, converters, transformations, and dead letter queues. Let’s briefly describe each of these terms now so we have a basic, shared understanding when they are used later. Let’s keep it high level for now and get into more specifics later as we progress in understanding.

Connectors are either Sinks or Sources written within the framework and coordinates the desired data flow through Tasks.

A Task is the code implementation of how data is streamed to or from Kafka.

A Worker is running processes responsible for executing connectors and tasks.

A Converter supports particular data formats such as Avro, Protobuf, ByteArray, etc. when pushing or pulling from Kafka. It may be helpful to consider it a translator.

A Transformation is simple logic to transform messages as they flow through connectors. Two examples of Transformations are InsertField and RemoveField which perform transformations as their name implies: add or remove fields from the message.

A Dead Letter Queue is an approach to addressing messages that should be rejected for a variety of reasons was they flow to or from Kafka. A Dead Letter Queue is often abbreviated to DLQ.

It might be confusing when the term Plugin is used. A plugin can mean a Connector, Converter, or Transformation. When configuring a worker, you must specify the Plugin classpath so any available connector, converter, or transformation is available for the Worker at runtime.

Operation Modes

Kafka Connect can be run executed in a standalone isolated process or distributed across multiple workers. It is often most straightforward to get started in standalone mode as it is the most widely covered mode in existing documentation. Standalone mode is another way of describing the act of running a single Worker. On the other hand, Distributed mode provides a mechanism for horizontal scale by providing a way to coordinate and distribute work across multiple Workers. Workers run Connectors and Tasks in a one-to-many fashion. For example, a particular Worker, regardless if running as Standalone or Distributed might be running (or have available to run) multiple connectors with associated task(s) for each connector.

Kafka Connect Standalone Mode Diagram (single worker)

For an example, see How To Run Kafka Connect in Standalone and Distributed Mode Examples.

Scaleability

As previously introduced earlier, scaling out is achieved by deployment of additional workers’ connector tasks in order to distribute and segment processing. But, some of you may be wondering, how do workers coordinate with each other? In other words how is the coordination of distributed tasks orchestrated? Orchestration is a key construct in distributed systems. In this case, multiple connect workers coordinating together is no different.

The orchestration answer is found in Kafka Consumer Group functionality. Orchestration of connect workers is built upon the constructs provided in Kafka Consumer Groups. By the way, this is true in Kafka Streams as well. Kafka Streams utilizes Consumer Group behavior in addition to Connect.

I’ve probably mentioned it before, but if you want to go deep in the Kafka ecosystem, you need to understand Kafka Consumer Group functionality.

A short version and example of Kafka Consumer Group functionality in Connect can be described when considering the act of rebalancing a particular workload. Rebalancing occurs when changes in connector(s) and their Worker’s tasks have been determined. For example, a new worker comes online, or a worker and it’s associated connectors’ tasks go offline. This is similar in concept to adding or removing Consumers from a Group. And much like Consumer Groups, rebalancing attempts to ensure each Worker has approximately the same amount of work.

Just like Consumer Groups, all Workers configured with the same group.id will be orchestrated together to distribute processing through the act of rebalancing.

Note: rebalancing applies to workers only. If a Worker’s particular task fails, no rebalance is triggered because a task failure is unexpected. Failed tasks should be restarted via the REST API.

Kafka Connect Distributed Mode Example (multiple workers with same group.id)

Available Connectors

There are hundreds of connectors available today under a variety of open-source and proprietary licenses. I’ve included examples of Sink and Source connectors below and if you are looking for any examples not covered below, just let me know.

Management

Management of connectors is combination of configuration files to initialize the startup followed by an available REST interface. After starting a worker in either Standalone or Distributed mode, all connectors found in the aforementioned configurable `plugin.path` will be available for operational management such as start and stop via the REST interface.

For example, a REST GET call to list all the available plugins on a Worker might look like

curl http://localhost:8083/connector-plugins | jq '.'

Or, an example to start a particular Connector with a REST POST call example

curl -X POST -H "Accept:application/json" -H "Content-Type: application/json" --data @joinable.json http://localhost:8083/connectors | jq '.'

Again, these are just examples of Connect management related REST API. Much more in-depth examples are provided in the specific connector tutorials below.

Monitoring

In addition to REST calls to determine status such as RUNNING, PAUSED, FAILED, etc. Connect expose performance metrics via JMX.

Each worker process has a variety of metrics and each connector and task have additional metrics. A worker process contains producer and consumer metrics in addition to metrics specific to Connect.

Managed Services

Kafka Connect is available as a managed service similar to Kafka. Some examples of providers of Kafka Connect as a managed service include offerings from Confluent Cloud, Amazon MSK Connect, and Aiven.

Further Resources

Kafka Connect documentation

Kafka Connect Examples

For an ever growing list of Kafka Connect tutorials and examples, see list below.

How to Determine Kafka Connect License or Not?

January 4, 2024 by Todd M

Determining whether a Kafka Connect source or sink connector requires a license to be purchased usually depends on the specific connector and its vendor. Because Kafka Connect itself is an open-source framework, but individual connectors may have different licensing models. Put another way, as a developer and/or operator or anyone trying to build and create … Read more

StreamConnectors.com Launches with Exclusive Kafka Connect Kinesis Source Connector – No Strings Attached!

December 31, 2023 by Todd M

New Kafka Connect Kinesis Source connector

We’re thrilled to announce the official launch of StreamConnectors.com, a data integration marketplace platform designed exclusively for streaming software developers and operators. At the heart of StreamConnectors.com is a commitment to providing a seamless experience for data integration, offering a diverse array of source and destination connectors, along with expert services to support your streaming … Read more

Why Kafka Connect and Why Not?

December 22, 2022 by Todd M

Apache Kafka Connect is a development framework for data integration between Apache Kafka and other systems. It facilitates moving data between Kafka and other systems, such as databases, message brokers, and file systems. A connector which moves data INTO Kafka is called a “Source”, while a connector which moves data OUT OF Kafka is called … Read more

Kafka Connect REST API Essentials

November 19, 2022 by Todd M

Kafka Connect API Examples Swagger Screenshot

The Kafka Connect REST API endpoints are used for both administration of Kafka Connectors (Sinks and Sources) as well as Kafka Connect service itself. In this tutorial, we will explore the Kafka Connect REST API with examples. Before we dive into specific examples, we need to set the context with an overview of Kafka Connect … Read more

What You Need to Know About Debezium

October 16, 2022 by Todd M

If you’re looking for an application for change data capture which includes speed, durability, significant history in production deployments across a variety of use cases, then Debezium may be for you. This open-source platform provides streaming from a wide range of both relational and NoSQL based databases to Kafka or Kinesis. There are many advantages … Read more

Running Kafka Connect [Standalone vs Distributed Mode Examples]

May 9, 2023May 9, 2020 by Todd M

Kafka Connect Examples Distributed and Standalone Modes

One of the many benefits of running Kafka Connect is the ability to run single or multiple workers in tandem. This is referred to as running Kafka Connect in Standalone or Distributed mode. Running multiple workers in distributed mode provides a way for horizontal scale-out which leads to increased capacity, automated resiliency, or both. For … Read more

Azure Kafka Connect Example – Blob Storage

August 30, 2023April 23, 2020 by Todd M

In this Azure Kafka tutorial, let’s describe and demonstrate how to integrate Kafka with Azure’s Blob Storage with existing Kafka Connect connectors. Let’s get a little wacky and cover writing to Azure Blob Storage from Kafka as well as reading from Azure Blob Storage to Kafka. In this case, “wacky” is a good thing, I … Read more

GCP Kafka Connect Google Cloud Storage Examples

August 30, 2023April 8, 2020 by Todd M

GCP Kafka Connect Google Cloud Storage GCS

In this GCP Kafka tutorial, I will describe and show how to integrate Kafka Connect with GCP’s Google Cloud Storage (GCS). We will cover writing to GCS from Kafka as well as reading from GCS to Kafka. Descriptions and examples will be provided for both Confluent and Apache distributions of Kafka. I’ll document the steps … Read more

Kafka Connect S3 Examples

August 30, 2023April 4, 2020 by Todd M

In this Kafka Connect S3 tutorial, let’s demo multiple Kafka S3 integration examples. We’ll cover writing to S3 from one topic and also multiple Kafka source topics. Also, we’ll see an example of an S3 Kafka source connector reading files from S3 and writing to Kafka will be shown. Examples will be provided for both … Read more

Kafka Connect mySQL Examples

August 30, 2023November 14, 2018 by Todd M

In this Kafka Connect mysql tutorial, we’ll cover reading from mySQL to Kafka and reading from Kafka and writing to mySQL. Let’s run this on your environment. Now, it’s just an example and we’re not going to debate operations concerns such as running in standalone or distributed mode. The focus will be keeping it simple and get it working. We … Read more