Francesco Tisiot | The Blog Pros

February 14, 2024

Managing Data Drift With Apache Kafka® Connect and a Schema Registry

Data flows across many technologies, teams, and people in today's businesses. Businesses are always growing and changing, so the way we collect and share data changes all the time too. We need to know not only who owns certain data but also what to do if that data changes. This problem is often referred to as "data drift".

Consider the scenario where a piece of data is modified at its source — what implications does this have for other systems reliant on it? How do we communicate necessary changes to stakeholders? Conversely, how do we prevent changes that could disrupt the system?

January 26, 2024

Consistent Change Data Capture Across Multiple Tables

Change data capture (CDC) is a widely adopted pattern to move data across systems. While the basic principle works well on small single-table use cases, things get complicated when we need to take into account consistency when information spans multiple tables. In cases like this, creating multiple 1-1 CDC flows is not enough to guarantee a consistent view of the data in the database because each table is tracked separately. Aligning data with transaction boundaries becomes a hard and error-prone problem to solve once the data leaves the database.

This tutorial shows how to use PostgreSQL logical decoding, the outbox pattern, and Debezium to propagate a consistent view of a dataset spanning over multiple tables.

September 22, 2023

Using PostgreSQL® JSON Functions To Navigate Reviews of Restaurants in India

The original idea behind relational databases was "structure, then data": you needed to define what the data looked like before being able to insert any content. This strict data structure definition helped keeping datasets in order by verifying data types, referential integrity, and additional business conditions using dedicated constraints.

But sometimes, life can't be predicted, and data can take different shapes. To enable some sort of flexibility, modern databases like PostgreSQL® started adding semistructured column options JSON, where only a formal check on the shape of the data is done.

June 13, 2023

Pros and Cons of Multi-Step Data Platforms

In the modern world, it's rare to have the data in the same shape and platform from the beginning till the end of its journey. Yes, some technologies can achieve quite a good range of functionalities but sometimes at the expense of precision, developer experience, or performance. Therefore, to achieve better or faster results, people might select a new tool for a precise task and start an implementation and integration process to move the data around.

This blog post highlights the pros and cons of a "one shoe fits all approach," where one platform is used for all the use cases vs. the "best tool for the job," where various tools and integrations are used to fulfill the requirements.

May 19, 2023

PostgreSQL JSONB Cheatsheet: Complete and Fast Lookup Guide

How do I extract a JSON item? What about tabulating the content? Can I build a set of rows from an array?

Dealing with JSON datasets in PostgreSQL is becoming more and more common, and we can see the mix PostgreSQL + JSON appearing frequently in StackOverflow. Knowing all the PostgreSQL JSON functions and operators by heart might make you famous at a PostgreSQL trivia night, but is not an essential skill to have.

April 14, 2023

Is Apache Kafka Providing Real Message Ordering?

One of Apache Kafka’s most known mantras is “it preserves the message ordering per topic-partition,” but is it always true? In this blog post, we’ll analyze a few real scenarios where accepting the dogma without questioning it could result in unexpected and erroneous sequences of messages.

Basic Scenario: Single Producer

We can start our journey with a basic scenario: a single producer sending messages to an Apache Kafka topic with a single partition, in sequence, one after the other.

March 16, 2023

From Data Stack to Data Stuck: The Risks of Not Asking the Right Data Questions

Companies are in continuous motion: new requirements, new data streams, and new technologies are popping up every day. When designing new data platforms supporting the needs of your company, failing to perform a complete assessment of the options available can have disastrous effects on a company’s capability to innovate and make sure its data assets are usable and reusable in the long term.

Having a standard assessment methodology is an absolute must to avoid personal bias and properly evaluate the various solutions across all the needed axes. The SOFT Methodology provides a comprehensive guide of all the evaluation points to define robust and future-proof data solutions. However, the original blog doesn’t discuss a couple of important factors: why is applying a methodology like SOFT important? And, even more, what risks can we encounter if we’re not doing so? This blog aims to cover both aspects.

February 10, 2023

A SOFT Methodology to Define Robust Data Platforms

In today's data-driven era, it's critical to design data platforms to help business to foster innovation and compete in the market. Selecting a robust, future-proof set of tools and architecture requires an act of balance between purely technological concerns and the wider constraints of the project. These constraints include challenges regarding regulations, existing workforce skills, talent acquisition, agreed timelines, and your company’s established processes.

A modern data platform is the set of technologies, configuration, and implementation details that allows data to be stored and moved across the company systems to satisfy business needs. The SOFT methodology introduces an approach, based on four pillars, to define future-proof and effective data platforms that are scalable, observable, fast, and trustworthy. Its aim is to enable data architects and decision-makers to evaluate both current implementations and future architectures across a different set of criteria.