3 Simple Ideas to Make Your Life Easier With Kafka

Apache Kafka was open-sourced by LinkedIn in early 2011. Despite all the initial limitations, it was a huge success and it became the de-facto standard for streaming data. The performance, possibility to replay events and multiple consumers independently were some of the features which disrupted the streaming arena.

But Kafka has been also known for its difficult learning curve and difficulties with the operation. In my experience, both things are improved a lot in the last few years but the original gotchas remain:

How to Build and Debug a Flink Pipeline Based in Event Time

One of the most important concepts for stream-processing frameworks is the concept of time. There are different concepts of time:

  • Processing time: It’s the time-based on the clock of the machine where the event is being processed. It’s easy to use, but, because that time changes when the job is executed, the result of the job isn’t consistent. Each time you execute the job, you may have different results. This isn’t an acceptable trade-off for many use cases.
  • Event time: It’s time-based on some of the fields in the event, typically a timestamp field. Each time you execute the pipeline with the same input, you obtain the same result, which is a good thing. But it also tends to be a bit harder to work with it for several reasons. We’ll cover them later in the article.
  • Ingestion time: It’s based on the timestamp when the event was ingested in the streaming platform (Kafka) and it usually goes in the metadata. From a Flink perspective, we can consider it a particular mix of Event time and processing time with the disadvantages of both.

Apache Flink has excellent support for Event time processing, probably the best of the different stream-processing frameworks available. For more information, you can read Notions of Time: Event Time and Processing Time in the official documentation. If you prefer videos, Streaming Concepts and Introduction to Flink - Event Time and Watermarks is a good explanation.

A Gentle (and Practical) Introduction to Apache Avro (Part 1)

This post is a gentle introduction to Apache Avro. After several discussions with Dario Cazas about what’s possible with Apache Avro, he did some research and summarized it in an email. I found myself looking for that email several times to forward it to different teams to clarify doubts about Avro. After a while, I thought it could be useful for others, and this is how this series of three posts was born.

In summary, Apache Avro is a binary format with the following characteristics: