Real-Time Streaming ETL Using Apache Kafka, Kafka Connect, Debezium, and ksqlDB

As most of you already know, ETL stands for Extract-Transform-Load and is the process of moving data from one source system to another. First, we will clarify why we need to transfer data from one point to another; second, we will look at traditional approaches; finally, we will describe how one can build a real-time streaming ETL process using Apache Kafka, Kafka Connect, Debezium, and ksqlDB.

When we build our business applications, we design the data model considering the functional requirements of our application. We do not take account of any kind of operational or analytical reporting requirements. A data model for reporting requirements is to be denormalized, whereas the data model for operations of an application is to be mostly normalized. So, for reporting or any kind of analytical purposes, we are required to convert our data model into denormalized form. 

Apache Kafka Patterns and Anti-Patterns

Apache Kafka offers the operational simplicity of data engineers' dreams. A message broker that allows clients to publish and read streams of data — Kafka has an ecosystem of open-source components that, when combined together, help store, process, and integrate data streams with other parts of your system in a secure, reliable, and scalable manner. This Refcard dives into select patterns and anti-patterns spanning across Kafka Client APIs, Kafka Connect, and Kafka Streams, covering topics such as reliable messaging, scalability, error handling, and more.

Build a Data Pipeline on AWS With Kafka, Kafka Connect, and DynamoDB

There are many ways to stitch data pipelines: open source components, managed services, ETL tools, etc. In the Kafka world, Kafka Connect is the tool of choice for "streaming data between Apache Kafka and other systems." It has an extensive set of pre-built source and sink connectors as well as a common framework for Kafka connectors, which standardizes integration of other data systems with Kafka and makes it simpler to develop your own connectors, should there be a need to do so.

This is a two-part blog series that provides a step-by-step walkthrough of data pipelines with Kafka and Kafka Connect. I will be using AWS for demonstration purposes, but the concepts apply to any equivalent options (e.g., running these locally using Docker). Here are some of the key AWS services I will be using:

Apache Kafka Essentials

Dive into Apache Kafka: Readers will review its history and fundamental components — Pub/Sub, Kafka Connect, and Kafka Streams. Key concepts in these areas are supplemented with detailed code examples that demonstrate producing and consuming data, using connectors for easy data streaming and transformation, performing common operations in KStreams, and more.

Top 5 Apache Kafka Use Cases for 2022

Apache Kafka and Event Streaming are two of the most relevant buzzwords in tech these days. Do you wonder about my predicted top 5 event streaming architectures and use cases for 2022 to set data in motion? Check out the following presentation and learn about the Kappa architecture, hyper-personalized omnichannel, multi-cloud deployments, edge analytics, and real-time cybersecurity. 

Some followers might notice that I did the same presentation a year ago about the top 5 event streaming use cases for 2021. My predictions for 2022 partly overlap with this session. That's fine. It shows that event streaming with Apache Kafka is a journey and evolution to set data in motion.

When To Use Reverse ETL and When It Is an Anti-pattern

Most enterprises store their massive volumes of transactional and analytics data at rest in data warehouses or data lakes. Sales, marketing, and customer success teams require access to these data sets. Reverse ETL is a buzzword that defines the concept of collecting data from existing data stores to provide it easy and quick for business teams.

This blog post explores why software vendors (try to) introduce new solutions for Reverse ETL, when it is needed, and how it fits into the enterprise architecture. The involvement of event streaming with tools like Apache Kafka to process data in motion is a crucial piece of Reverse ETL for real-time use cases.

Change Data Captures CDC from MySQL Database to Kafka with Kafka Connect and Debezium

Introduction

Debezium is an open-source project developed by Red Hat which aims to simplify this process by allowing you to extract changes from various database systems (e.g. MySQL, PostgreSQL, MongoDB) and push them to Kafka


Debezium Connectors

Debezium has a library of connectors that capture changes from a variety of databases and produce events with very similar structures, making it easier for the applications to consume and respond to the events regardless of where the changes originated. Debezium currently have the following connectors

Apache Kafka, KSQL, and Apache PLC4X for Industrial IoT and Automation

Learn more about IIoT automation with Apache Kafka, KSQL, and Apache PLC4X

Data integration and processing is a huge challenge in Industrial IoT (IIoT, aka Industry 4.0 or Automation Industry) due to monolithic systems and proprietary protocols. Apache Kafka, its ecosystem (Kafka Connect, KSQL), and Apache PLC4X are a great open-source choice to implement this IIoT integration end-to-end in a scalable, reliable, and flexible way.

This blog post covers a high-level overview of the challenges and good, flexible architecture to solve the problems. In the end, I share a video recording and the corresponding slide deck. These provide many more details and insights.