Apache Kafka | The Blog Pros

August 22, 2019

Voxxed Days Microservices: Katherine Stanley on “Creating Event-Driven Microservices: The Why, How and What”

Hi Katherine, tell us who you are and what lead you into microservices?

My name is Katherine Stanley, although I am generally known as Kate. I work at IBM as a software engineer on a product called IBM Event Streams. IBM Event Streams is a fully supported Apache Kafka offering with value-add capabilities. I first became interested in microservices when I was working for the IBM WebSphere Liberty team. My role was to work with our customers to make sure getting started with Liberty was as easy as possible. My team quickly discovered that although Liberty is easy to use, our customers were getting stuck on microservices. I have always enjoyed presenting so since 2016 I have been presenting at conferences all around the world to help people understand and write better microservices.

August 20, 2019

Apache Kafka In Action

Challenges and Limitations in Messaging

Messaging is a fairly simple paradigm for the transfer of data between applications and data stores. However, there are multiple challenges that are associated with it:

Limited scalability due to a broker becoming a bottleneck.
Strained message brokers due to bigger message size.
Consumers being in a position to consume the messages at a reasonable rate.
Consumers exhibiting non-fault tolerance by making sure that messages consumed are not gone forever.

Messaging Limitations Due to:

High Volume

Messaging applications are hosted on a single host or a node. As a result, there is a possibility of the broker becoming the bottleneck due to a single host or local storage.

August 9, 2019

What Happened to Hadoop? What Should You Do Now?

Apache Hadoop emerged on the IT scene in 2006 with the promise to provide organizations with the capability to store an unprecedented volume of data using commodity hardware. This promise not only addressed the size of the data sets, but also the type of data, such as data generated by IoT devices, sensors, servers, and social media that businesses were increasingly interested in analyzing. The combination of data volume, velocity, and variety was popularly known as Big Data.

Schema-on-read played a vital role in the popularity of Hadoop. Businesses thought they no longer had to worry about the tedious process of defining which tables contained what data and how are they connected to each other — a process that took months and not a single data warehouse query could be executed before it was complete. In this brave new world, businesses could store as much data as they could get their hands on in Hadoop-based repositories known as data lakes and worry about how it is going to be analyzed later.

August 6, 2019

Real-Time Stream Processing With Apache Kafka Part 2: Kafka Stream API

This is the second part of the four parts series of articles. In the previous article, we introduced you to Apache Kafka. In this article, we will briefly discuss Kafka APIs with special attention given to Kafka's Streams API.

Kafka Terminologies

Before we have a deep dive in Kafka streams, here's a quick refresher on important concepts in Kafka.

July 10, 2019

Ultimate Guide to Installing Kafka Docker on Kubernetes

In this ultimate guide I will give you a simple step-by-step tutorial on installing Kafka Docker on Kubernetes. This post includes a complete video walk-through.

There has been a lot of interest lately about deploying Kafka to a Kubernetes cluster. If you are wanting to take the deep dive yourself then you found the right article. Now that we have Kafka Docker, deploying a Kafka cluster to Kubernetes is a snap.

July 3, 2019

Using Vertica With Spark-Kafka: Reading

Vertica is a tool which is really helpful in working with big data. Accordingt to LogiAnalytics:

Vertica is a columnar storage platform designed to handle large volumes of data, which enables very fast query performance in traditionally intensive scenarios.

June 28, 2019

Ingesting Data From Apache Kafka to TimescaleDB

The Glue Conference (better known as GlueCon) is always a treat for me. I've been speaking there since 2012, and this year I presented a session explaining how I use StreamSets Data Collector to ingest content delivery network (CDN) data from compressed CSV files in S3 to MySQL for analysis, using the Kickfire API to turn IP addresses into company data. The slides are here, and I'll write it up in a future post.

As well as speaking, I always enjoy the keynotes (shout out to Leah McGowen-Hare for an excellent presentation on inclusion!) and breakouts. In one of this year's breakouts, Diana Hsieh, director of product management at Timescale, focused on the TimescaleDB time series database.

June 14, 2019

JavaLand 2019 Retrospective

In this article, I talk about my impressions from the JavaLand 2019 conference. This was my second time at the international conference, which, this year, took place in the theme park "Phantasialand" in Bruehl, near Cologne, Germany, from March 18th-20th.

Additionally, you can download the presentations here, as well as lecture recordings here.

May 20, 2019

How to Stream Records to Kafka With Akka Streams and Alpakka

This blog will show you how records can be streamed to Kafka using Akka Streams with Alpakka. Alpakka is an open source project that provides a number of connectors, and in this blog, we will use the Alpakka connector for Kafka.

Before digging into it further, you can read more about Kafka here.

May 2, 2019

This Week in Spring: Spring Cloud, Kafka, and More

Hi, Spring fans, and welcome to another installment of This Week in Spring! This week, I’m in colorful Chicago, Illinois, and magnificent Milwaukee, Wisconsin. I am so excited to be in both places. I’m in Chicago for the GOTO Chicago event, which is always fun, and I’m in Milwaukee for a meetup and, of course, to partake of local delicacies like Kopps, a Cousins sub, and a Spotted Cow beer. Life is great!

Did you see this epic image of the Chicago Lakeshore area I took the other day? Spring is truly in full swing! (I’ll ignore the buckets of rain that have been dumped on Chicago and Milwaukee just a day after that photo was taken…)

March 13, 2019

Lambda Architecture: How to Build a Big Data Pipeline, Part 1

The Internet of Things is the current hype, but what kinds of challenges do we face with the consumption of big amounts of data? With a large number of smart devices generating a huge amount of data, it would be ideal to have a big data system holding the history of that data. However, processing large data sets is too slow to maintain real-time updates of devices. The two requirements for real-time tracking and keeping results accurately up to date can be satisfied by building a lambda architecture.

"Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data."

March 7, 2019

How Are Your Microservices Talking?

In this piece, which originally appeared here, we’ll look at the challenges of refactoring SOAs to MSAs, in light of different communication types between microservices, and see how pub-sub message transmission — as a managed Apache Kafka Service — can mitigate or even eliminate these challenges.

If you’ve developed or updated any kind of cloud-based application in the last few years, chances are you’ve done so using a Microservices Architecture (MSA), rather than the slightly more dated Service-Oriented Architecture (SOA). So, what’s the difference?

January 28, 2019

Apache Kafka Topics: Architecture and Partitions

What Is a Kafka Topic?

A Kafka topic is essentially a named stream of records. Kafka stores topics in logs. However, a topic log in Apache Kafka is broken up into several partitions. And, further, Kafka spreads those log’s partitions across multiple servers or disks. In other words, we can say a topic in Kafka is a category, stream name, or a feed.

In addition, we can say topics in Apache Kafka are a pub-sub style of messaging. Moreover, there can be zero to many subscribers called Kafka consumer groups in a Kafka topic. Basically, these topics in Kafka are broken up into partitions for speed, scalability, as well as size.

January 25, 2019

Apache Kafka Load Testing Using JMeter

In simple words, Apache Kafka is a hybrid of a distributed database and a message queue. In order to process terabytes of information, many large companies use it. Also, for its features, Kafka is widely popular. For example, a company like LinkedIn uses it to stream data about user activity, while the company like Netflix uses it for data collection and buffering for downstream systems like Elasticsearch, Amazon EMR, Mantis, and many more.

Moreover, let’s throw light on some features of Kafka that are important for Kafka load testing: