ETL | The Blog Pros

January 11, 2024

Fast, Secure, and Highly Available Real-Time Data Warehousing Based on Apache Doris

This is a whole-journey guide for Apache Doris users, especially those from the financial sector, which requires a high level of data security and availability. If you don't know how to build a real-time data pipeline and make the most of the Apache Doris functionalities, start with this post, and you will be loaded with inspiration after reading.

This is the best practice of a non-banking payment service provider that serves over 25 million retailers and processes data from 40 million end devices. Data sources include MySQL, Oracle, and MongoDB. They were using Apache Hive as an offline data warehouse but feeling the need to add a real-time data processing pipeline. After introducing Apache Doris, they increase their data ingestion speed by 2~5 times, ETL performance by 3~12 times, and query execution speed by 10~15 times.

December 1, 2023

Handling Errors and Maintaining Data Integrity in ETL Processes

Error Mitigation in ETL Workflows

ETL — Extract, Transform, Load — is far more than a mere buzzword in today’s data-driven landscape. This methodology sits at the crossroads of technology and business, making it integral to modern data architectures. Yet, the complexities and intricacies involved in ETL processes make them susceptible to errors. These errors are not just 'bugs' but can be formidable roadblocks that could undermine data integrity, jeopardize business decisions, and lead to significant financial loss. Given the pivotal role that ETL processes play in organizational data management, understanding how to handle and mitigate these errors is non-negotiable. In this blog, we will explore the different kinds of ETL errors you might encounter and examine both proactive and reactive strategies to manage them effectively.

The Intricacies and Multilayered Complexities of ETL Workflows

The phrase "ETL" may sound straightforward—after all, it's just about extracting, transforming, and loading data. However, anyone who has architected or managed ETL workflows knows that the simplicity of the acronym belies a host of underlying complexities. The devil, as they say, is in the details.

August 17, 2023

ETL vs. ELT

At first glance, it may be difficult to discern the differences between ETL and ELT. While similar in appearance, the acronyms refer to different approaches to moving and processing data, revealing the evolution and growth of data over the years.

ETL and ELT are processes used by data integration tools. Through each process, data is pulled from different sources and transformed into useful information.

November 8, 2022

SAP S/4HANA, Microsoft SQL Integration and Hard Deletion Handling

This article will demonstrate the heterogeneous systems integration and building of the BI system and mainly talk about the DELTA load issues and how to overcome them. How can we compare the source table and target table when we cannot find a proper way to identify the changes in the source table using the SSIS ETL Tool?

Systems Used

SAP S/4HANA is an Enterprise Resource Planning (ERP) software package meant to cover all day-to-day processes of an enterprise, e.g., order-to-cash, procure-to-pay, finance & controlling request-to-service, and core capabilities. SAP HANA is a column-oriented, in-memory relational database that combines OLAP and OLTP operations into a single system.
SAP Landscape Transformation (SLT) Replication is a trigger-based data replication method in the HANA system. It is a perfect solution for replicating real-time data or schedule-based replication from SAP and non-SAP sources.
Azure SQL Database is a fully managed platform as a service (PaaS) database engine that handles most of the management functions offered by the database, including backups, patching, upgrading, and monitoring, with minimal user involvement.
SQL Server Integration Services (SSIS) is a platform for building enterprise-level data integration and transformation solutions. SSIS is used to integrate and establish the pipeline for ETL and solve complex business problems by copying or downloading files, loading data warehouses, cleansing, and mining data.
Power BI is an interactive data visualization software developed by Microsoft with a primary focus on business intelligence.

Business Requirement

Let us first talk about the business requirements. We have more than 20 different Point-of-Sale (POS) data from other online retailers like Target, Walmart, Amazon, Macy's, Kohl's, JC Penney, etc. Apart from this, the primary business transactions will happen in SAP S/4HANA, and business users will require the BI reports for analysis purposes.

May 21, 2022

ETL, ELT, and Reverse ETL

This is an article from DZone's 2022 Data Pipelines Trend Report.

For more:

Read the Report

ETL (extract, transform, load) has been a standard approach to data integration for many years. But the rise of cloud computing and the need to integrate self-service data has led to the development of new methodologies such as ELT (extract, load, transform) and reverse ETL.

March 3, 2021

Data Integration and ETL for Dummies (Like Me)

In early 2020, I was introduced to the idea of data integration through a friend who was working in the industry. Yes, I know. Extremely late. All I knew about it was that I could have my data in one (virtual) place and then have it magically appear in another (virtual) place. I had no clue how it was done or how important it was to modern businesses.

To give you some background, my past work experience is not in any kind of technical space. It is in business development and marketing for non-technical products. I probably should have been more aware of the technical world around me, but for now, you must forgive me for my ignorance.

September 3, 2020

What is Persistent ETL and Why Does it Matter?

If you’ve made it to this blog you’ve probably heard the term “persistent” thrown around with ETL, and are curious about what they really mean together. Extract, Transform, Load (ETL) is the generic concept of taking data from one or more systems and placing it in another system, often in a different format. Persistence is just a fancy word for storing data. Simply put, persistent ETL is adding a storage mechanism to an ETL process. That pretty much covers the what, but the why is much more interesting…

ETL processes have been around forever. They are a necessity for organizations that want to view data across multiple systems. This is all well and good, but what happens if that ETL process gets out of sync? What happens when the ETL process crashes? What about when one of the end systems updates? These are all very real possibilities when working with data storage and retrieval systems. Adding persistence to these processes can help ease or remove many of these concerns.

March 17, 2020

ETL and How it Changed Over Time

What Is ETL?

ETL is the abbreviation for Extract, Transformation, and Load. In simple terms, it is just copying data between two locations.[1]

Extract: The process of reading the data from different types of sources including databases.
Transform: Converting the extracted data to a particular format. Conversion also involves enriching the data using other data in the system.
Load: The process of writing the data to a target database, data warehouse, or another system.

ETL can be differentiated into 2 categories with regards to the infrastructure.

January 28, 2020

Change Data Capturing With WSO2 Streaming Integrator

Streaming integration is becoming one of the core components under the enterprise integration stack. Unlike traditional batch integration, streaming integration allows performing ETL operations upon data in real-time and provides results. This empowers businesses by allowing them to act upon fresh data and draw decisions as soon as the data is produced.

But from where does this data produce? Most of the time, this streaming data is being produced from various data sources such as applications, sensors, etc. And in some cases, well-known data sources such as RDBMS can participate in producing such streaming data. This is where the CDC, a.k.a change data capture comes into the picture.

November 27, 2019

How to Do a Snowflake Query Pushdown in Talend

How to Do a Snowflake Query Pushdown in Talend

In a typical/traditional data warehouse solution, the data is read into ETL memory, processed/transformed in the memory before loading into the target database. With the growing data, the cost of compute is also increasing and hence it becomes vital to look for an alternate design.

Welcome to pushdown query processing. The basic idea of pushdown is that certain parts of SQL queries or the transformation logic can be "Pushed" to where the data resides in the form of generated SQL statements. So instead of bringing the data to processing logic, we take the logic to where data resides. This is very important for performance reasons.

October 4, 2019

Life Beyond Kafka With Apache Pulsar

During all my years as a Solution Architect, I have built many streaming architectures, such as real-time data ETL, reactive microservices, log collection, and even AI-driven services, all using Kafka as a core part of their architecture. Kafka is a proven stream-processing platform used for many years at companies like LinkedIn, Microsoft, and Netflix. In many cases Kafka works very well, supports large amounts of data, and has a good community. Because of that, Kafka is used for many streaming scenarios.

However, due to the design of Kafka, all of my projects using Kafka have been suffering similar problems:

September 30, 2019

StreamSets Transformer Extensibility: Spark and Machine Learning Part One

Apache Spark has been on the rise for the past few years, and it continues to dominate the landscape when it comes to in-memory and distributed computing, real-time analysis, and machine learning use cases. And with the recent release of StreamSets Transformer, a powerful tool for creating highly instrumented Apache Spark applications for modern ETL, you can quickly start leveraging all the benefits and power of Apache Spark with minimal operational and configuration overhead.

In this blog, you will learn how to extend StreamSets Transformer in order to train a Spark ML RandomForestRegressor model.