From Batch ML To Real-Time ML

Real-time machine learning refers to the application of machine learning algorithms that continuously learn from incoming data and make predictions or decisions in real-time. Unlike batch machine learning, where data is collected over a period and processed in batches offline, real-time ML operates instantaneously on streaming data, allowing for immediate responses to changes or events.

Common use cases include fraud detection in financial transactions, predictive maintenance in manufacturing, recommendation systems in e-commerce, and personalized content delivery in media. Challenges in building real-time ML capabilities include managing high volumes of streaming data efficiently, ensuring low latency for timely responses, maintaining model accuracy and performance over time, and addressing privacy and security concerns associated with real-time data processing. This article delves into these concepts and provides insights into how organizations can overcome these challenges to deploy effective real-time ML systems.

Building Real-Time Applications to Process Wikimedia Streams Using Kafka and Hazelcast

In this tutorial, developers, solution architects, and data engineers can learn how to build high-performance, scalable, and fault-tolerant applications that react to real-time data using Kafka and Hazelcast.

We will be using Wikimedia as a real-time data source. Wikimedia provides various streams and APIs (Application Programming Interfaces) to access real-time data about edits and changes made to their projects. For example, this source provides a continuous stream of updates on recent changes, such as new edits or additions to Wikipedia articles. Developers and solution architects often use such streams to monitor and analyze the activity on Wikimedia projects in real-time or to build applications that rely on this data, like this tutorial. Kafka is great for event streaming architectures, continuous data integration (ETL), and messaging systems of record (database). Hazelcast is a unified real-time stream data platform that enables instant action on data in motion by combining stream processing and a fast data store for low-latency querying, aggregation, and stateful computation against event streams and traditional data sources. It allows you to build resource-efficient, real-time applications quickly. You can deploy it at any scale from small edge devices to a large cluster of cloud instances.

When Speed Matters: Real-Time Stream Processing With Hazelcast and Redpanda

In this tutorial, we explore the powerful combination of Hazelcast and Redpanda to build high-performance, scalable, and fault-tolerant applications that react to real-time data.

Redpanda is a streaming data platform designed to handle high-throughput, real-time data streams. Compatible with Kafka APIs, Redpanda provides a highly performant and scalable alternative to Apache Kafka. Redpanda's unique architecture enables it to handle millions of messages per second while ensuring low latency, fault tolerance, and seamless scalability.

Boosting Similarity Search With Stream Processing

The goal of similarity search and vector databases is to find similar results to the search query for unstructured data, such as text, images, and videos. The unstructured data first is vectorized, and stored in a vector format. There are publicly available tools to create vectors from unstructured data; similarly, there are vector databases to store and perform similarity searches. This is important because of the rising popularity of Large Language Models (LLMs) and their combination with vector databases.

Here, we present a hybrid approach by taking the strengths of vector databases and boosting them with traditional search and filtering techniques based on real-time stream processing. Vector databases are good for building high-performance vector search applications. On the other hand, Hazelcast can be used for real-time stream processing and fast data storage for structured data (filters, tags, and contextual data). Some vector databases offer filtering on structural data, which can be used or replaced with Hazelcast. In either case, Hazelcast can be used to enrich your query results, from additional resources.

Enriching Kafka Applications With Contextual Data

Developing high-performance large-stream processing applications is a challenging task. Choosing the right tool(s) is crucial to get the job done; as developers, we tend to focus on performance, simplicity, and cost. However, the cost becomes relatively high if we end up with two or more tools to do the same task. Simply put, you need to multiply development time, deployment time, and maintenance costs by the number of tools. 

Kafka is great for event streaming architectures, continuous data integration (ETL), and messaging systems of record (database). However, Kafka has some challenges, such as a complex architecture with many moving parts, it can’t be embedded, and it’s a centralized middleware, just like a database. Moreover, Kafka does not offer batch processing, and all intermediate steps are materialized to disk in Kafka. This leads to enormous disk space usage. 

How To Create a Failover Client Using the Hazelcast Viridian Serverless

Failover is an important feature of systems that rely on near-constant availability. In Hazelcast, a failover client automatically redirects its traffic to a secondary cluster when the client cannot connect to the primary cluster. Consider using a failover client with WAN replication as part of your disaster recovery strategy. In this tutorial, you’ll update the code in a Java client to automatically connect to a secondary, failover cluster if it cannot connect to its original, primary cluster. You’ll also run a simple test to make sure that your configuration is correct and then adjust it to include exception handling. You'll learn how to collect all the resources that you need to create a failover client for a primary and secondary cluster, create a failover client based on the sample Java client, test failover and add exception handling for operations.

Step 1: Set Up Clusters and Clients

Create two Viridian Serverless clusters that you’ll use as your primary and secondary clusters and then download and connect sample Java clients to them.