Open Source Change Data Capture in 2023


Let’s consider three open source change data capture (CDC) options ready for production in the year 2023.

Before we begin, let’s confirm we all see the CDC trend. 

To me, it seems everywhere you look these days is all about change data capture. From my perspective that wasn’t the case for many years.

Do you see it too?

It’s all CDC this, Outbox pattern that.  Change Data Capture seems to have finally reached one of Moore’s two majority phases of early and late majority adopters.

What do you think of this trend?  Do you think it’s a positive trajectory or race to the bottom?   That’s just great.  As usual, I agree with you.

Table of Contents

Let’s recap CDC real quick…

Change data capture (CDC) is a technique used to identify and track changes to data in a source database and replicate these changes downstream. I covered the what and how of Change Data Capture a while ago.

CDC is a useful for introducing a light-weight mechanism for real-time data synchronization between databases, and for keeping track of changes to data over time, or implementing event driven architecture without risky “rip and replace” consequences.

CDC is often used to send to data warehousing and real-time analytics (stream processors) where separation between operational and analytical data stores is still commonplace.

How does it work?

CDC works by continuously monitoring the database for changes and capturing a record of those changes as they occur from a transaction log. Each change record written to the particular database “change log” or “write ahead log” contains information about the type of change (e.g., insert, update, delete), the time of the change, and the particular data change.

The name of the change log varies according to the source database.

This change log is monitored and changes replicated to downstream databases or stream storage logs such as Kafka. The alternative is continuously polling the database with SELECT statements and WHERE filters which puts more of a processing burden on the database.

Why?

There are several reasons why change data capture (CDC) can be useful including near real-time data synchronization between data stores, introducing loosely coupled stream processors, evolving to newer architectures.

Open Source based Change Data Capture in 2023

  • Airbyte
  • Debezium
  • Singer.io

Did I include all options such as Maxwell or Canal?  Nope.  Just three.  As I’ve mentioned here before, that’s one of the benefits of having your own big time blog like this.  I get to call the shots and you get to read for free.  Maybe I even teach a thing or two.  I know I learn by writing and hearing from you.

That reminds me.  I’ve got an open mind.  If you think I should have included others besides these three, let me know in the comments below.  I may or may not read them.  What!?  I may not read your comments!?  Well, you appreciate honesty right?  This is me being honest.

Anyhow, let’s get on to the 2023 Change Data Capture Open source list.

Airbyte

Airbyte is an open-source data integration platform that provides a range of tools and capabilities for extracting, transforming, and loading data from various sources including change data capture (CDC) based sources.

A key feature of Airbyte is its support for CDC.  As expected, Airbyte supports a wide range of databases, and other data sources, including MySQL, PostgreSQL, SQL Server. See https://docs.airbyte.com/understanding-airbyte/cdc/ for the latest.

In addition to CDC, Airbyte also provides a range of tools and features for data transformation and data cleansing.  It also has support for scheduling and automating data integration tasks.

Have you tried it?  Deployed it to production?  Convince me why I should deploy to production in the comments below.

Debezium

Ah, good old Debezium.  To me this is the OG of CDC.

Debezium is a tool used for change data capture (CDC).  Debezium has been covered here before.

Quick sales pitch: I have a Debezium course for folks who like to save time and become efficient quickly rather than searching all over hell and back for free tutorials.  If you prefer a guide in your Debezium journey, I’m here for you.

No money? Well, you are a problem solver. How about asking your company for reimbursement for the course?  Remember, as my Mom use to say, the answer is always no unless you ask. 

Open Source Change Data Capture options in 2023

Now, I know this may come as a complete shock, but Debezium is an open-source CDC platform designed to capture data changes from databases such as MySQL, PostgreSQL, MongoDB, Oracle and more. 

Can you believe it!? It recently released support for LogMiner based Oracle CDC which was common ask in my experience over the years.

Also, here is another one which might seem like a stretch to you — Debezium uses a log-based approach to CDC where log files of the source database are read so change records can be extracted from them. 

Are you shocked yet!?  I bet you probably knew or could guess that already though because gosh darnit, people like you and they think you’re pretty smart.

Quick note: I understand sarcasm can be difficult in written form, but I’m a risk taker.

Debezium has been, and continues to be a popular choice for CDC.

Things you don’t know about Debezium

Here’s a bit or two you probably don’t know about Debezium.  Seriously.

Debezium was originally built using Kafka Connect framework, but it is becoming less and less dependent on Kafka.  Debezium can be deployed outside of a Kafka Connect cluster using the “Embedded Engine” which supports integration with Kinesis. 

Next, keep an eye on “Debezium Server” which is considered in incubating state at the time of this writing.  There are all kinds of options besides Kafka planned via Debezium Server.

Singer.io

Singer.io is an open-source tool for loading data from various sources. It is based on the Singer specification, which defines a standard way for extracting data from sources and transforming it into a common format.

I’ve never used it but have heard good things. Are they true?

Open Source Change Data Capture Pros and Cons

Pros

Open-source means that it is free to use and modify. This can mean significant cost savings for organizations needing to build data pipelines but don’t want to invest in proprietary tools.

Open source means extensible and customizable to allow developers to build custom integrations that can extract data from a wide variety of sources and load it into a destination system.

No vendor lock-in.  Yep, this is still a thing.

Cons

Open source change data capture has limited documentation and support. This is the usual con for open source. This can make open source change data capture options more challenging for users who are new to the platform to get up and running quickly.

Complexity. It may require a more significant investment in time and resources to get up and running, especially for users who are new to the platform.

Limited out-of-the-box integrations means it can be challenging if you don’t have the resources or expertise to build custom integrations.

Proprietary change data capture (CDC) alternatives

  • Attunity Replicate is a commercial CDC option supporting a variety of database platform including Oracle.
  • Informatica PowerCenter is an option for CDC from databases. It supports a wide range of databases, including Oracle.
  • Talend Data Integration is another commercial data integration platform that includes support for CDC.

For years, the main reason why a person would go with a proprietary options such as one of these or GoldenGate was for performant Oracle support.  Is that still true?

See also  Streaming Data Engineer Use Cases
About Todd M

Todd has held multiple software roles over his 20 year career. For the last 5 years, he has focused on helping organizations move from batch to data streaming. In addition to the free tutorials, he provides consulting, coaching for Data Engineers, Data Scientists, and Data Architects. Feel free to reach out directly or to connect on LinkedIn

Leave a Comment