Leveraging Change Data Capture for Fraud Detection using Arcion Cloud

During the height of the business intelligence (BI) craze earlier in my career, I worked with an internal reporting team to expose data for extract, transform, and load (ETL) processes that leveraged data structures inspired by Ralph Kimball. It was a new and exciting time in my life to understand how to optimize data for reporting and analysis. Honestly, the schema looked upside down to me, based on my experience with transaction-driven designs.

In the end, there were many moving parts and even some dependencies for the existence of a flat file to make sure everything worked properly. The reports ran quickly, but one key factor always bothered me: I was always looking at yesterday’s data.

How to Set Up and Run PostgreSQL Change Data Capture

The architecture of modern web applications consists of several software components such as dashboards, analytics, databases, data lakes, caches, search, etc.

The database is usually the core part of any application. Real-time data updates keep disparate data systems in continuous sync and respond quickly to new information. So how to keep your application ecosystem in sync? How do these other components get information about changes in the database? Change Data Capture or CDC refers to any solution that identifies new or changed data.

SaaS Galore: Integrating CockroachDB With Confluent Kafka, Fivetran, and Snowflake

Motivation

The problem this tutorial is trying to solve is the lack of a native Fivetran connector for CockroachDB. My customer has built their analytics pipeline based on Fivetran. Given there is no native integration, their next best guess was to set up a Postgres connector:

CockroachDB is PostgreSQL wire compatible, but it is not correct to assume it is 1:1. Let's attempt to configure the connector:

Audit Database Changes with Debezium

Debezium Logo

In this article, we will explore Debezium to capture data changes. Debezium is a distributed open-source platform for change data capture. Point the Debezium connector to the database and start listening to the change data events like inserts/updates/deletes right from the database transaction logs that other applications commit to your database.

Debezium is a collection of source connectors of Apache Kafka Connect. Debezium's log-based Change Data Capture ( ) allows ingesting the changes directly from the database's transaction logs. Unlike other approaches, such as polling or dual writes, the log-based approach brings the below features.

Change Data Capture With Debezium: A Simple How-To, Part 1

One question always comes up as organizations moving towards being cloud-native, twelve-factor, and stateless: How do you get an organization’s data to these new applications? There are many different patterns out there, but one pattern we will look at today is change data capture. This post is a simple how-to on how to build out a change data capture solution using Debezium within an OpenShift environment. Future posts will also add to this and add additional capabilities.

What Is Change Data Capture?

Another Red Hatter, Sadhana Nandakumar, sums it up well in one of her posts around change data capture:

Tracking Changes in MongoDB With Scala and Akka

Need for Real-Time Consistent Data

Many different databases are used at Adform, each tailored for specific requirements, but what is common for these use cases is the necessity for a consistent interchange of data between these data stores. It’s a tedious task to keep the origin of that data and its copies consistent manually, not to mention that with a sufficiently large number of multiplications the origin may not be the source of truth anymore. The need for having its own copy of data is also dictated by the necessity of loose coupling and performance. It wouldn’t be practical to be constantly impacted by every change made in the source system. The answer here is an event-based architecture which allows to keep every change consistent and provides us with the possibility of restoring the sequence of changes related to particular entities. For those reasons, the decision was made to use the publisher/subscriber model. MongoDB’s change streams saved the day, finally letting us say farewell to much more complex oplog tailing.

Change Streams

As of version 3.6 MongoDB offers change data capture implementation named as change streams. It allows us to follow every modification made to an entire database or chosen set of collections. Previous versions already offered some solution to that problem by means of oplog (operation log) mechanism but tailing it directly had serious drawbacks, especially huge traffic caused by iteration overall changes to all collections and lack of reliable API allowing to resume tracking after any interruption. Change streams solve these issues by hiding oplog’s nook and crannies from us behind refined API interoperable with reactive streams implementations.

Domain Events Versus Change Data Capture

The building of change data capture (CDC) and event based systems have recently come up several time in my discussions with people and in my online trawling. I sensed enough confusion around them that I figured this was worth talking about here.
CDC and event-based communication are two very different things which look similar to some extent, hence the confusion. Beware — confusing one for the other can lead to very difficult architectural situations.

What Are These Things?

Change Data Capture (CDC) typically alludes to a mechanism for capturing all changes happening to a system's data. The need for such a system is not difficult to imagine — audit for sensitive information, data replication across multiple DB instances or data centers, moving changes from transactional databases to data lakes/OLAP stores. Transaction management in ACID compliant databases is essentially CDC. A CDC system is a record of every change every made to an entity and the metadata of that change (changed by, change time etc).
You may also enjoy: Change Data Capture (CDC) With Embedded Debezium and Spring Boot
I have written about events on this blog before and have described them as announcements of something that has happened in the system domain, with relevant data about that occurrence. At a glance, this might seem to be the same as CDC — something changes in a system and this needs to be communicated to other systems — which is exactly what CDC is about.

However, there is a key distinction to be made here. Events are defined at a far higher level of abstraction than data changes because they are meaningful changes to the domain. Data representing an entity can change without it having any "business" impact on the overall entity that the data represents. There can be several sub-states of an order that an order management system might maintain internally but which do not matter to the outside world. 

An order moving to these states would not generate events but changes would be logged in the CDC system. On the other hand, there are be states that the rest of the world cares about (created, dispatched etc) and the order management system explicitly exposes to the outside world. Changes to or from these states would generate events.

Change Data Capture (CDC) With Embedded Debezium and SpringBoot

While working with data or replicating data sources, you probably have heard the term Change Data Capture (CDC). As the name suggests, “CDC” is a design pattern that continuously identifies and captures incremental changes to data. This pattern is used for real-time data replication across live databases to analytical data sources or read replicas. It can also be used to trigger events based on data changes, such as the OutBox pattern.

Most modern databases support CDC through transaction logs. A transaction log is a sequential record of all changes made to the database while the actual data is contained in a separate file.

Implementing the Outbox Pattern

Looking outside the box.


You may also like: Design Patterns for Microservices

The Problem Statement

Microservices often publish events after performing a database transaction. Writing to the database and publishing an event are two different transactions and they have to be atomic. A failure to publish an event can mean critical failure to the business process.