Building an Enterprise CDC Solution

Introduction

This article is a follow-up to the Data Platform: Building an Enterprise CDC Solution, where Miguel García and I described:

  • Several Change Data Capture (CDC) use cases and common scenarios in an enterprise platform
  • A proposal using Debezium (as log-based CDC) to capture data from the relational databases, and Kafka as a channel that enables several consumers to propagate data changes for different use cases.

One of the common scenarios for this solution consists of data replication from OLTP Database to OLAP Database (from the operational database to the data warehouse).

Audit Database Changes with Debezium

Debezium Logo

In this article, we will explore Debezium to capture data changes. Debezium is a distributed open-source platform for change data capture. Point the Debezium connector to the database and start listening to the change data events like inserts/updates/deletes right from the database transaction logs that other applications commit to your database.

Debezium is a collection of source connectors of Apache Kafka Connect. Debezium's log-based Change Data Capture ( ) allows ingesting the changes directly from the database's transaction logs. Unlike other approaches, such as polling or dual writes, the log-based approach brings the below features.

Change Data Capture With Debezium: A Simple How-To, Part 1

One question always comes up as organizations moving towards being cloud-native, twelve-factor, and stateless: How do you get an organization’s data to these new applications? There are many different patterns out there, but one pattern we will look at today is change data capture. This post is a simple how-to on how to build out a change data capture solution using Debezium within an OpenShift environment. Future posts will also add to this and add additional capabilities.

What Is Change Data Capture?

Another Red Hatter, Sadhana Nandakumar, sums it up well in one of her posts around change data capture:

Change Data Capture From PostgreSQL to Azure Data Explorer Using Kafka Connect

This blog post demonstrates how you can use Change Data Capture to stream database modifications from PostgreSQL to Azure Data Explorer (Kusto) using Apache Kafka.

Change Data Capture (CDC) can be used to track row-level changes in database tables in response to create, update and delete operations. It is a powerful technique, but useful only when there is a way to leverage these events and make them available to other services.

Using PostgreSQL pgoutput Plugin for Change Data Capture With Debezium

Change Data Capture Architecture Using Debezium, Postgres, and Kafka
was a tutorial on how to use Debezium for change data capture from Azure PostgreSQL and send them to Azure Event Hubs for Kafka - it used the wal2json output plugin.

What About the pgoutput Plugin?

This blog will provide a quick walk through of how to pgoutput plugin. I will not be repeating a lot of details and use containerized versions (using Docker Compose) of Kafka connect, Kafka (and Zookeeper) to keep things simple. So, the only thing you need is Azure PostgreSQL, which you can setup using a variety of options including, the Azure Portal, Azure CLI, Azure PowerShell, ARM template.

Change Data Capture Architecture Using Debezium, Postgres, and Kafka

Change Data Capture (CDC) is a technique used to track row-level changes in database tables in response to create, update and delete operations. Different databases use different techniques to expose these change data events - for example, logical decoding in PostgreSQL, MySQL binary log (binlog) etc. This is a powerful capability, but useful only if there is a way to tap into these event logs and make it available to other services which depend on that information.

Debezium does just that! It is a distributed platform that builds on top of Change Data Capture features available in different databases. It provides a set of Kafka Connect connectors which tap into row-level changes (using CDC) in database table(s) and convert them into event streams. These event streams are sent to Apache Kafka which is a scalable event streaming platform - a perfect fit! Once the change log events are in Kafka, they will be available to all the downstream applications.

Change Data Captures CDC from MySQL Database to Kafka with Kafka Connect and Debezium

Introduction

Debezium is an open-source project developed by Red Hat which aims to simplify this process by allowing you to extract changes from various database systems (e.g. MySQL, PostgreSQL, MongoDB) and push them to Kafka


Debezium Connectors

Debezium has a library of connectors that capture changes from a variety of databases and produce events with very similar structures, making it easier for the applications to consume and respond to the events regardless of where the changes originated. Debezium currently have the following connectors

How to Build your First Real-Time Streaming (CDC) System Part 1

Introduction

With the exponential growth of data and a lot of businesses moving online, it has become imperative to design systems that can act in real-time or near real-time to make any business decisions. So, after working on multiple backend projects through many years, I finally got to do build a real-time streaming platform. While working on the project, I did start experimenting with different tech stacks to deal with this. So, I am trying to share my learnings in a series of articles. Here is the first of them.

Target Audience

This post is aimed at engineers who are already familiar with microservices and Java and are looking to build their first real-time streaming pipeline. This POC is divided into 4 articles for the purpose of readability. They are as follows:

Change Data Capture (CDC) With Embedded Debezium and SpringBoot

While working with data or replicating data sources, you probably have heard the term Change Data Capture (CDC). As the name suggests, “CDC” is a design pattern that continuously identifies and captures incremental changes to data. This pattern is used for real-time data replication across live databases to analytical data sources or read replicas. It can also be used to trigger events based on data changes, such as the OutBox pattern.

Most modern databases support CDC through transaction logs. A transaction log is a sequential record of all changes made to the database while the actual data is contained in a separate file.

Implementing the Outbox Pattern

Looking outside the box.


You may also like: Design Patterns for Microservices

The Problem Statement

Microservices often publish events after performing a database transaction. Writing to the database and publishing an event are two different transactions and they have to be atomic. A failure to publish an event can mean critical failure to the business process.