How to Build and Debug a Flink Pipeline Based in Event Time

One of the most important concepts for stream-processing frameworks is the concept of time. There are different concepts of time:

  • Processing time: It’s the time-based on the clock of the machine where the event is being processed. It’s easy to use, but, because that time changes when the job is executed, the result of the job isn’t consistent. Each time you execute the job, you may have different results. This isn’t an acceptable trade-off for many use cases.
  • Event time: It’s time-based on some of the fields in the event, typically a timestamp field. Each time you execute the pipeline with the same input, you obtain the same result, which is a good thing. But it also tends to be a bit harder to work with it for several reasons. We’ll cover them later in the article.
  • Ingestion time: It’s based on the timestamp when the event was ingested in the streaming platform (Kafka) and it usually goes in the metadata. From a Flink perspective, we can consider it a particular mix of Event time and processing time with the disadvantages of both.

Apache Flink has excellent support for Event time processing, probably the best of the different stream-processing frameworks available. For more information, you can read Notions of Time: Event Time and Processing Time in the official documentation. If you prefer videos, Streaming Concepts and Introduction to Flink - Event Time and Watermarks is a good explanation.

The Foundations for Building an Apache Flink Application

In this article, we'll work to give you a better understanding of stream processing using Flink from the bottom-up; cloud services and other platforms provide stream processing solutions (for some, Flink is integrated under the hood). If you miss the basics, this guide is for you.

Stream processing in Flink

Our monolith solution does not cope with the increased load of incoming data, and thus it has to evolve. This is the time for the next generation of our product. Stream processing is the new data ingestion paradigm, as compared to the batch processing we’ve implemented so far.

Apache Flink With Kafka – Consumer and Producer

Overview

Apache Flink provides various connectors to integrate with other systems. In this article, I will share an example of consuming records from Kafka through FlinkKafkaConsumer and producing records to Kafka using FlinkKafkaProducer.

Setup

I installed Kafka locally and created two Topics, TOPIC-IN and TOPIC-OUT

Deep Dive Into Apache Flink’s TumblingWindow – Part 1

In this article, I will share coding examples some of the key aspects of TumblingWindow in Flink. Those not familiar with Flink streaming can get an introduction here. 

Before we get into TumblingWindow, let us get a basic understanding of  "Window" when it comes to stream processing or streaming computation.  In a data stream you have a source that is continuously producing data, making it unfeasible to compute a final value.  

Streaming ETL With Apache Flink

Streaming data computation is becoming more and more common with the growing Big Data landscape. Many enterprises are also adopting or moving towards streaming for message passing instead of relying solely on REST APIs. 

Apache Flink has emerged as a popular framework for streaming data computation in a very short amount of time. It has many advantages in comparison to Apache Spark (e.g. lightweight, rich APIs, developer-friendly, high throughput, an active and vibrant community).

The Complete Apache Spark Collection [Tutorials and Articles]

In this edition of "Best of DZone," we've compiled our best tutorials and articles on one of the most popular analytics engines for data processing, Apache Spark. Whether you're a beginner or are a long-time user, but have run into inevitable bottlenecks, we've got your back!

Before we begin, we'd like need to thank those who were a part of this article. DZone has and continues to be a community powered by contributors like you who are eager and passionate to share what they know with the rest of the world. 

Streaming ETL With Apache Flink – Part 1

Flink: as fast as squirrels

Introduction

After working in multiple projects involving Batch ETL through polling data sources, I started working on Streaming ETL. Streaming computation is necessary for use cases where real or near real-time analysis is required. For example, in IT Operations Analytics, it is paramount that Ops get critical alert information in real-time or within acceptable latency (near real-time) to help them mitigate downtime or any errors caused due to misconfiguration.

While there are many introductory articles on Flink (my personal favorite are blogs from Ivan Mushketyk), not many have been into details of streaming ETL and advanced aspects of the Flink framework, which are useful in a production environment.

Make Crucial Predictions as Data Comes

Flink: as fast as a squirrel

Walking by the hottest IT streets in these days means you've likely heard about achieving Streaming Machine Learning, i.e. moving AI towards streaming scenario and exploiting the real-time capabilities along with new Artificial Intelligence techniques. Moreover, you will also notice the lack of research related to this topic, despite the growing interest in it.

If we try to investigate it a little bit deeper then, we realize that a step is missing: nowadays, well-known streaming applications still don't get the concept of Model Serving properly, and industries still lean on lambda architecture in order to achieve the goal. Suppose a bank has a concrete frequently updated batch trained Machine Learning model (e.g. an optimized Gradient Descent applied to past buffer overflow attack attempts) and it wants to deploy the model directly to their own canary.