What is Persistent ETL and Why Does it Matter?

If you’ve made it to this blog you’ve probably heard the term “persistent” thrown around with ETL, and are curious about what they really mean together. Extract, Transform, Load (ETL) is the generic concept of taking data from one or more systems and placing it in another system, often in a different format. Persistence is just a fancy word for storing data. Simply put, persistent ETL is adding a storage mechanism to an ETL process. That pretty much covers the what, but the why is much more interesting… 

ETL processes have been around forever. They are a necessity for organizations that want to view data across multiple systems. This is all well and good, but what happens if that ETL process gets out of sync? What happens when the ETL process crashes? What about when one of the end systems updates? These are all very real possibilities when working with data storage and retrieval systems. Adding persistence to these processes can help ease or remove many of these concerns. 

Streaming ETL With Apache Flink – Part 1

Flink: as fast as squirrels

Introduction

After working in multiple projects involving Batch ETL through polling data sources, I started working on Streaming ETL. Streaming computation is necessary for use cases where real or near real-time analysis is required. For example, in IT Operations Analytics, it is paramount that Ops get critical alert information in real-time or within acceptable latency (near real-time) to help them mitigate downtime or any errors caused due to misconfiguration.

While there are many introductory articles on Flink (my personal favorite are blogs from Ivan Mushketyk), not many have been into details of streaming ETL and advanced aspects of the Flink framework, which are useful in a production environment.

Top 5 Enterprise ETL Tools

With the ever-growing amounts of data, enterprises create an increasing demand for data warehousing projects and systems for advanced analytics. ETL is their essential element. It ensures successful data integration within various databases and applications. In this ETL tools comparison, we will look at:

  1. Apache NiFi
  2. Apache StreamSets
  3. Apache Airflow
  4. AWS Data Pipeline
  5. AWS Glue

They are among the most popular ETL tools 2019. Let's compare the pros and cons to find out the best solution for your project.