Next-Gen Data Pipes With Spark, Kafka and k8s

Introduction

Data integration has always played an essential role in the information architecture of any enterprise. Specifically, the analytical processes of the enterprise heavily depend on this integration pattern in order for the data to be made available from transactional systems and loaded in an analytics-friendly format. When the systems were not interconnected so much in the traditional architecture paradigm, and the latency between transactions and analytical insights was permissible and the integrations mainly were batch-oriented. 

In the batch pattern, typically, large files (data dump) are generated by the operational systems, and those are processed (validated, cleansed, standardized, and transformed) to create some output files to feed to the analytical systems. Of course, reading such large files was memory intensive; hence, data architects used to rely upon a series of staging databases to store step-by-step data processing output.  As the distributed computing evolved to Hadoop, MapReduce addressed the high memory requirement by distributing the processing across horizontally scalable commoditized hardware. As the computing technique has evolved further, it is now possible to run MapReduce in-memory, which today has become a kind of de-facto standard for processing large data files.