Change Data Capture – What, Why, How

Change Data Capture

Change Data Capture Overview

Change Data Capture (CDC) refers to the determination and tracking of changes made to the data in databases. CDC is utilized to extract and transport changes in online transactional processing (OLTP) data systems to downstream systems such as data warehouses, data lakes and/or stream processors.

In essence, CDC is implemented by writing to immutable transaction logs and then providing a mechanism to read from these logs. Recording transactions to log(s) minimize performance impact on the transaction database.  Turning on and configuring CDC also requires no change to tables or application.

Let’s consider a high-level example.  In an E-commerce application, all new, updated or deleted order transactions would be recorded via CDC to enable our streaming applications to consume and publish to downstream analytic systems, recommendation engines and/or fraud detection processors.  In our example, CDC is implemented without impacting the performance experience of the E-commerce end-user.

In the following sections, we’ll run through some more examples for emphasis.

Change Data Capture Vendor Examples

Change Data Capture is freely available out-of-the-box from database vendors such as Microsoft SQL Server, Oracle, PostgreSQL, and mySQL.

In Microsoft SQL Server, CDC records insert, edit and delete activities in the SQL Server table in a detailed format [1]. Column information, along with the metadata, is captured for the modified rows. Then, it is stored in append-only change tables. Table-valued functions enable users to have a systematic access to these change data tables.  Records within the change tables are immutable and therefore similar to the value of an immutable log.

Batch based Extraction, Transformation, and Loading (ETL) and/or streaming ELT applications may utilize the captured CDC data.  Let’s look at another example.

In Oracle, it is possible to capture and publish changed data in synchronous and asynchronous modes [2]. In synchronous mode, change data is captured as part of the transaction that modifies the source table. This mode uses triggers on the source database to capture change data. Change data is captured in real time on the source database. SYNC_SOURCE is a single, predefined synchronous change source that cannot be altered. Synchronous mode is cost-efficient, though it adds overhead to the source database at capture time.  In asynchronous mode, change data is captured from the database redo log files after changes have been made to the source database.

In mySQL, CDC is available as part of the binlog which was originally used in auditing and copying data to other mySQL systems [3]. But, the binlog may be utilized outside of mySQL for CDC events processing and saving to downstream analytics systems for example.

In PostgreSQL, change data capture is possible in either transaction logs or triggers. In transaction logs, all the write transactions (i.e., INSERT, UPDATE, DELETE, DDL’s) are written to the Write Ahead Logs (WAL) before the transaction result is sent to the user or client [4].  The WAL allows consumption of CDC events.

When to Consider Change Data Capture 

Change data capture plays a significant role in streaming data processing and pipelines. As the amount of data grows rapidly, the need for CDC techniques becomes crucial in order to handle data inflow for analytics and near real-time analytics such as machine learning and artificial intelligence (AI).

Problems arise when long-running, taxing analytic queries are introduced on OLTP systems which affect overall application performance. By using CDC-based technologies, it allows users to capture database mutations such as inserts, updates, and deletions automatically to process in streaming applications and/or to store in destinations suited for analytic queries.

CDC implementations contain the metadata necessary to understand the changes made. CDC technologies reduce cost and enable the improvement of data quality, accuracy, and provide a mechanism to create streaming architectures. CDC is a solution for continuous and accelerating growth in data volumes, reducing load time, resources and cost.

Change data capture summary list of advantages:

  • CDC implementation does not require application changes.
  • CDC only requires to configure database and objects.
  • CDC enables the user to identify change history.
  • CDC enables the user to add context information to every DML if required.
  • CDC has auto cleanup feature, deleting information automatically based on the retention period.

 

Change Data Capture has its own disadvantages:

  • CDC does not track change time.
  • CDC does not track security context.
  • CDC does not save how the data changed; it only tracks that a change was made.
  • CDC may add slight overhead to the system, depending on the number of changes.

 

Options to CONSUME Change Data Capture

Once you have configured CDC in Oracle, SQL Server, you may be wondering how can you consume and process the CDC events?  How can you implement a streaming architecture with CDC?  What are your options for building CDC stream processors?  etc.

At the time of this writing, here are a few options to consider and not listed in any particular order

  1. Debezium – Open Source.  Built on Kafka Connect framework. All of Debezium’s connectors are Kafka Connector source connectors so they have the pros and cons associated with Kafka Connect.
  2. StreamSets – Open Source.  Out of the box support for all CDC implementations described here as well as others.
  3. Attunity – Proprietary.

 

In addition, you may consider vendor-specific options as well such as

  1. Oracle Golden Gate
  2. Dynamo DB Streams

 

Conclusion

Change data capture is available out-of-the-box in many database systems such as MS SQL Server, Oracle, mySQL, Postgres, Mongo (opLog), Dynamo (DynamoDB Streams), etc.  CDC is prevalent in streaming architectures when implementing separation of concerns between transactions and other concerns such as analytics, search indexing, machine learning, AI, near real-time monitoring and alerting.

Hope this helps!

Let me know if you have any questions or concerns.

 

References

[1] SQL Server https://docs.microsoft.com/en-us/sql/relational-databases/track-changes/about-change-data-capture-sql-server?view=sql-server-2017

[2] Oracle CDC https://docs.oracle.com/cd/E11882_01/server.112/e25554/cdc.htm

[3] MySQL binlog https://dev.mysql.com/doc/internals/en/binary-log-overview.html

[4] PostgreSQL WAL https://www.postgresql.org/docs/11/runtime-config-wal.html

Image credit https://pixabay.com/en/deer-dream-animal-fantasy-1333814/

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.