etl | The Blog Pros

April 19, 2022

When To Use Apache Camel vs. Apache Kafka?

Should I use Apache Camel or Apache Kafka for my next integration project? The question is very valid and comes up regularly. This blog post explores both open-source frameworks and explains the difference between application integration and event streaming. The comparison discusses when to use Kafka or Camel when to combine them, and when not to use them at all. A decision tree shows how you can quickly qualify one for the other.

The History of Application Integration and Event Streaming

My personal history and experience in application integration and event streaming are the following. It shows my background and how I see the integration and data streaming markets.

April 7, 2022

SaaS Galore: Integrating CockroachDB With Confluent Kafka, Fivetran, and Snowflake

Motivation

The problem this tutorial is trying to solve is the lack of a native Fivetran connector for CockroachDB. My customer has built their analytics pipeline based on Fivetran. Given there is no native integration, their next best guess was to set up a Postgres connector:

CockroachDB is PostgreSQL wire compatible, but it is not correct to assume it is 1:1. Let's attempt to configure the connector:

December 6, 2021

Migrating From Sakila-MySQL to Couchbase – ETL

Part 1 covers setting up the MySQL Sakila database, extracting the data into a JSON or CSV format, and loading that data into Couchbase. Let's look at the steps.

1 - Install MySQL Database, The Shell, MySQL Workbench, and Couchbase 7.0

https://dev.mysql.com/downloads/

November 27, 2021

Building an ETL Pipeline With Airflow and ECS

Each day, enterprise-level companies collect, store and process different types of data from multiple sources. Whether it’s a payroll system, sales records, or inventory system, this torrent of data has to be attended to.

And if you process data from multiple sources that you want to squeeze into a centralized database, you need to:

November 12, 2021

When To Use Reverse ETL and When It Is an Anti-pattern

Most enterprises store their massive volumes of transactional and analytics data at rest in data warehouses or data lakes. Sales, marketing, and customer success teams require access to these data sets. Reverse ETL is a buzzword that defines the concept of collecting data from existing data stores to provide it easy and quick for business teams.

This blog post explores why software vendors (try to) introduce new solutions for Reverse ETL, when it is needed, and how it fits into the enterprise architecture. The involvement of event streaming with tools like Apache Kafka to process data in motion is a crucial piece of Reverse ETL for real-time use cases.

November 10, 2021

What is Data Lineage and How Can It Ensure Data Quality?

Introduction

Are you spending too much time tracking down bugs for your C-level dashboards? Are different teams struggling to align on what data is needed throughout the organization? Or are you struggling with getting a handle on what the impact of a potential migration could be?

Data lineage could be the answer you need for data quality issues. By improving data traceability and visibility, a data lineage system can improve data quality across your whole data stack and simplify the task of communicating about the data that your organization depends on.

October 13, 2021

How To Build GitHub Activity Dashboard With Open-Source

In this article, we will be leveraging Airbyte - an open-source data integration platform and Metabase - an open-source way for everyone in your company to ask questions and learn from data - to build the GitHub activity dashboard above.

Airbyte provides us with a rich set of source connectors, and one of those is the GitHub connector which allows us to get data off a GitHub repo. We are going to use this connector to get the data of the Airbyte repo and copy them into a Postgres database destination. We will then connect this database to Metabase in order to create the activity dashboard. In order to do so, we will need:

July 30, 2021

Why ETL Needs Open Source to Address the Long Tail of Integrations

Over the last year, our team has interviewed more than 200 companies about their data integration use cases. What we discovered is that data integration in 2021 is still a mess.

The Unscalable Current Situation

At least 80 of the 200 interviews were with users of existing ETL technology, such as Fivetran, StitchData, and Matillion. We found that every one of them was also building and maintaining their own connectors even though they were using an ETL solution (or an ELT one — for simplicity, I will just use the term ETL). Why?

July 23, 2021

Next-Gen Data Pipes With Spark, Kafka and k8s

Introduction

Data integration has always played an essential role in the information architecture of any enterprise. Specifically, the analytical processes of the enterprise heavily depend on this integration pattern in order for the data to be made available from transactional systems and loaded in an analytics-friendly format. When the systems were not interconnected so much in the traditional architecture paradigm, and the latency between transactions and analytical insights was permissible and the integrations mainly were batch-oriented.

In the batch pattern, typically, large files (data dump) are generated by the operational systems, and those are processed (validated, cleansed, standardized, and transformed) to create some output files to feed to the analytical systems. Of course, reading such large files was memory intensive; hence, data architects used to rely upon a series of staging databases to store step-by-step data processing output. As the distributed computing evolved to Hadoop, MapReduce addressed the high memory requirement by distributing the processing across horizontally scalable commoditized hardware. As the computing technique has evolved further, it is now possible to run MapReduce in-memory, which today has become a kind of de-facto standard for processing large data files.

July 1, 2021

Your Ultimate Guide to Redshift ETL: Best Practices, Advanced Tips, and Resources

Introduction

Amazon Redshift makes it easier to uncover transformative insights from big data. Analytical queries that once took hours can now run in seconds. Redshift allows businesses to make data-driven decisions faster, which in turn unlocks greater growth and success.

For a CTO, full-stack engineer, or systems architect, the question isn’t so much what is possible with Amazon Redshift, but how? How do you ensure optimal, consistent runtimes on analytical queries and reports? And how do you do that without taxing precious engineering time and resources?

June 16, 2021

Data Fabrics Modernize Existing Data Management

Introduction

Data management agility takes precedence among organizations with diverse, distributed, and disruptive environment. It is one of the most crucial deciding factors in determining a company’s potential to transform data into opportunities. But managing data remains an uphill climb thanks to advancements in big data and the Internet of Things (IoT).

Data management is susceptible to errors and delays that can impact operational efficiency and value generation. Problems aggravate when traditional data management practices are used—and the overall performance of a company hits the skids.

May 13, 2021

Top 7 ETL Tools for 2021

Organizations of all sizes and industries now have access to ever-increasing amounts of data, far too vast for any human to comprehend. All this information is practically useless without a way to efficiently process and analyze it, revealing the valuable data-driven insights hidden within the noise.

The ETL (extract, transform, load) process is the most popular method of collecting data from multiple sources and loading it into a centralized data warehouse. During the ETL process, information is first extracted from a source such as a database, file, or spreadsheet, then transformed to comply with the data warehouse’s standards, and finally loaded into the data warehouse.

April 30, 2021May 16, 2021

5 Customer Data Integration Best Practices

For the last few years, you have heard the terms "data integration" and "data management" dozens of times. Your business may already invest in these practices, but are you benefitting from this data gathering?

Too often, companies hire specialists, collect data from many sources and analyze it for no clear purpose. And without a clear purpose, all your efforts are in vain. You can take in more customer information than all your competitors and still fail to make practical use of it.

March 3, 2021

Data Integration and ETL for Dummies (Like Me)

In early 2020, I was introduced to the idea of data integration through a friend who was working in the industry. Yes, I know. Extremely late. All I knew about it was that I could have my data in one (virtual) place and then have it magically appear in another (virtual) place. I had no clue how it was done or how important it was to modern businesses.

To give you some background, my past work experience is not in any kind of technical space. It is in business development and marketing for non-technical products. I probably should have been more aware of the technical world around me, but for now, you must forgive me for my ignorance.

February 19, 2021

Geo-Distributed Data Lakes Explained

Geo-Distributed Data Lake is quite the mouthful. It’s a pretty interesting topic and I think you will agree after finishing this breakdown. There is a lot to say about how awesome it is to combine the flexibility of a data lake with the power of a distributed architecture, but I’ll get more into the benefits of both as a joint solution later. To start, I want to look at geo-distributed data lakes in two parts before we marry them together, for my non-developer brain that made the most sense! No time to waste, let’s kick things off with the one and only… data lakes.

It’s a Data LAKE, Not Warehouse!

It shouldn’t be a shock to the system to point out that we are living in a data-driven world going into 2021. Because of this, 'data lakes' are a fitting term for the amount of data companies are collecting. In my opinion, we could probably start calling them data oceans, expansive and seemingly never-ending. So what is a data lake exactly?

February 5, 2021

Making Your Data Flow Resiliently With Apache NiFi Clustering

Introduction

In a previous article, we covered the need to take into account a number of factors relevant to both the infrastructure and application when evaluating the placement and performance of the workload within an edge computing environment. These data points included standard measurements around network bandwidth, CPU and RAM utilization, disk I/O performance, as well as other more transient items, such as adjacent services and resource availability.

Each of these data points is critical input towards operating an efficient edge computing cloud environment and ensuring the overall health of the applications. In this article, we’ll touch on some of the numerous challenges that can be encountered with the collection and transformation of data into a format that is serviceable for use in analytics, as well as how to construct a resilient data flow ensuring data continuity.

January 25, 2021

How Has COVID-19 Impacted Data Science?

The COVID-19 pandemic disrupted supply chains and brought economies around the world to a standstill. In turn, businesses need access to accurate, timely data more than ever before. As a result, the demand for data analytics is skyrocketing as businesses try to navigate an uncertain future. However, the sudden surge in demand comes with its own set of challenges.

Here is how the COVID-19 pandemic is affecting the data industry and how enterprises can prepare for the data challenges to come in 2021 and beyond.

January 15, 2021May 12, 2021

What Is Chaos Engineering?

In the past, software systems ran in highly controlled environments on-premise and managed by an army of sysadmins. Today, migration to the cloud is relentless; the stage has completely shifted. Systems are no longer monolithic and localized; they depend on many globalized uncoupled systems working in unison, often in the form of ethereal microservices.

It is no surprise that Site Reliability Engineers have risen to prominence in the last decade. Modern IT infrastructure requires robust systems thinking and reliability engineering to keep the show on the road. Downtime is not an option. A 2020 ITIC Cost of Downtime survey indicated that 98% of organizations said that a single hour of downtime costs more than $150,000. 88% showed that 60 minutes of downtime costs their business more than $300,000. And 40% of enterprises reported that one hour of downtime costs their organizations $1 million to more than $5 million.