Next-Gen Data Pipes With Spark, Kafka, and K8s: Part 2

Introduction 

In our previous article, we discussed two emerging options for building new-age data pipes using stream processing. One option leverages Apache Spark for stream processing and the other makes use of a Kafka-Kubernetes combination of any cloud platform for distributed computing. The first approach is reasonably popular, and a lot has already been written about it. However, the second option is catching up in the market as that is far less complex to set up and easier to maintain. Also, data-on-the-cloud is a natural outcome of the technological drivers that are prevailing in the market. So, this article will focus on the second approach to see how it can be implemented in different cloud environments.

Kafka-K8s Streaming Approach in Cloud

In this approach, if the number of partitions in the Kafka topic matches with the replication factor of the pods in the Kubernetes cluster, then the pods together form a consumer group and ensure all the advantages of distributed computing. It can be well depicted through the below equation:

What Is a Kubernetes Operator?

Kubernetes is popular due to its capability to deploy new apps at a faster pace. Thanks to "Infrastructure as data" (specifically, YAML), today you can express all your Kubernetes resources such as Pods, Deployments, Services, Volumes, etc., in a YAML file. These default objects make it much easier for DevOps and SRE engineers to fully express their workloads without the need to learn how to write code in a programming language like Python, Java, or Ruby.

Kubernetes is designed for automation. Out of the box, you get lots of built-in automation from the core of Kubernetes. It can speed up your development process by making easy, automated deployments, updates (rolling update), and by managing your apps and services with almost zero downtime. However, Kubernetes can’t automate the process natively for stateful applications. For example, say you have a stateful workload, such as a database application, running on several nodes. If a majority of nodes go down, you’ll need to reload the database from a specific snapshot following specific steps. Using existing default objects, types, and controllers in Kubernetes, this would be impossible to achieve.

Setting Up Apache Druid on Kubernetes in Under 30 Minutes

 I was introduced to Apache Druid a year and a half ago. During this time, I've focused on operationalizing Apache Druid on Kubernetes (K8s). I started with Helm Charts to spin up Druid clusters in this complex distributed Druid + K8s system, but I realized Helm Charts alone were not enough.

I’ve written Golang-based operators, custom controllers in Kubernetes for different use cases, and contributed various oss operators, so I was familiar with extending Kubernetes using Custom Resource Definitions (CRDs). I was thrilled to discover the Druid Operator, which had just been open-sourced in the Druid community in late 2019. The project was less than a month old when I started contributing to it.

Getting Started With Rancher

What is Rancher? And how does it make Kubernetes crazy easy? Rancher is a complete Kubernetes stack that's easy to navigate — whether it's physical servers on-prem or in the cloud. This Refcard helps you get started with Rancher — from zero to fully production-ready.