kafka connect platform | The Blog Pros

November 4, 2020

Change Data Capture From PostgreSQL to Azure Data Explorer Using Kafka Connect

This blog post demonstrates how you can use Change Data Capture to stream database modifications from PostgreSQL to Azure Data Explorer (Kusto) using Apache Kafka.

Change Data Capture (CDC) can be used to track row-level changes in database tables in response to create, update and delete operations. It is a powerful technique, but useful only when there is a way to leverage these events and make them available to other services.

October 7, 2020

Kafka for XML Message Integration and Processing

XML messages and XML Schema are not very common in the Apache Kafka and Event Streaming world! Why? Many people call XML legacy. It is complex, verbose, and often associated with the ugly WS-* Hell (SOAP, WSDL, etc). On the other side, every company older than five years uses XML. It is well understood, provides a good structure, and is human- and machine-readable.

This post does not want to start another flame war between XML and other technologies such as JSON (which also provides JSON Schema now), Avro, or Protobuf. Instead, I will walk you through the three main approaches to integrate between Kafka and XML messages as there is still a vast demand for implementing this integration today (often for integrating legacy applications and middleware).

August 31, 2020

Apache Kafka and SAP ERP Integration Options

A question I get every week from customers across the globe is, "How can I integrate my SAP system with Apache Kafka?" This post explores various alternatives, including connectors, third party tools, custom glue code, and trade-offs between the different options.

After exploring what SAP is, I will discuss several integration options between Apache Kafka and SAP systems:

August 21, 2020

Why you Should and How to Archive your Kafka Data to Amazon S3

Over the last decade, the volume and variety of data and data storage technologies have been soaring. Businesses in every industry have been looking for cost-effective ways to store it, and storage has been one of the main requirements for data retention. With a change to [near] real-time data pipelines and growing adoption of Apache Kafka, companies must find ways to reduce the Total Cost of Ownership (TCO) of their data platform.

Currently, Kafka is configured with a shorter, and typically three days, retention period. Data, older than the retention period, is copied, via streaming data pipelines, to scalable external storage for long-term use like AWS S3 or Hadoop Distributed File System (HDFS).

July 29, 2020

Kafka Connect on Kubernetes The Easy Way!

This is a tutorial that shows how to set up and use Kafka Connect on Kubernetes using Strimzi, with the help of an example.

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other systems using source and sink connectors. Although it's not too hard to deploy a Kafka Connect cluster on Kubernetes (just "DIY"!), I love the fact that Strimzi enables a Kubernetes-native way of doing this using the Operator pattern with the help of Custom Resource Definitions.

July 29, 2020

Data Pipeline Using MongoDB and Kafka Connect on Kubernetes

In Kafka Connect on Kubernetes, the easy way!, I had demonstrated Kafka Connect on Kubernetes using Strimzi along with the File source and sink connector. This blog will showcase how to build a simple data pipeline with MongoDB and Kafka with the MongoDB Kafka connectors, which will be deployed on Kubernetes with Strimzi.

I will be using the following Azure services:

July 28, 2020August 4, 2020

Change Data Capture Architecture Using Debezium, Postgres, and Kafka

Change Data Capture (CDC) is a technique used to track row-level changes in database tables in response to create, update and delete operations. Different databases use different techniques to expose these change data events - for example, logical decoding in PostgreSQL, MySQL binary log (binlog) etc. This is a powerful capability, but useful only if there is a way to tap into these event logs and make it available to other services which depend on that information.

Debezium does just that! It is a distributed platform that builds on top of Change Data Capture features available in different databases. It provides a set of Kafka Connect connectors which tap into row-level changes (using CDC) in database table(s) and convert them into event streams. These event streams are sent to Apache Kafka which is a scalable event streaming platform - a perfect fit! Once the change log events are in Kafka, they will be available to all the downstream applications.

April 23, 2020

How to Build your First Real-Time Streaming (CDC) System Part 1

Introduction

With the exponential growth of data and a lot of businesses moving online, it has become imperative to design systems that can act in real-time or near real-time to make any business decisions. So, after working on multiple backend projects through many years, I finally got to do build a real-time streaming platform. While working on the project, I did start experimenting with different tech stacks to deal with this. So, I am trying to share my learnings in a series of articles. Here is the first of them.

Target Audience

This post is aimed at engineers who are already familiar with microservices and Java and are looking to build their first real-time streaming pipeline. This POC is divided into 4 articles for the purpose of readability. They are as follows:

September 6, 2019

There and Back Again: Mqtt-Based Data Transfer With Kafka Cloud

Still not sure why they didn't just use the eagles from the beginning

Through this article, I would like to list down the detailed steps for communicating from a device to a server in the cloud running a scalable Kafka infrastructure and back from Kafka to the device using Mosquitto broker and Kafka connect.

You may also like: Building a Custom Kafka Connect Connector.

September 6, 2019

Apache Kafka, KSQL, and Apache PLC4X for Industrial IoT and Automation

Learn more about IIoT automation with Apache Kafka, KSQL, and Apache PLC4X

Data integration and processing is a huge challenge in Industrial IoT (IIoT, aka Industry 4.0 or Automation Industry) due to monolithic systems and proprietary protocols. Apache Kafka, its ecosystem (Kafka Connect, KSQL), and Apache PLC4X are a great open-source choice to implement this IIoT integration end-to-end in a scalable, reliable, and flexible way.

This blog post covers a high-level overview of the challenges and good, flexible architecture to solve the problems. In the end, I share a video recording and the corresponding slide deck. These provide many more details and insights.