Stateful Streaming in Spark

Apache Spark is a fast and general-purpose cluster computing system. In Spark, we can do the batch processing and stream processing as well. It does near real-time processing. It means that it processes the data in micro-batches. I have discussed more Spark Streaming in my previous blog. Now in this blog, I'll discuss Stateful Streaming in Spark. So let's start !!

What Is Stateful Streaming?

Stateful stream processing means that a "state" is shared between events and therefore past events can influence the way current events are processed. 

Unary Streaming via gRPC

If you want to get multiple responses and send multiple requests, it is time to use gRPC Streaming concepts. You can't do streaming with REST API, as REST API uses HTTP 1.1. In this blog, I will go through how you can do Unary Streaming via gRPC.

Types of APIs or Streaming in gRPC

gRPC supports four types of APIs to support streaming.

What Is Recon? How We Augmented XML and JSON For Streaming Data

Ever since applications started moving data records, we’ve needed ways to annotate those records with formatting instructions. Many of these record notation formats are familiar to developers. For example, according to the IETF, JSON “defines a small set of formatting rules for the portable representation of structured data.” In practical terms, JSON makes it possible to describe value pairs, arrays, or a series of values as a human-readable document.

Similarly, the prolific XML markup language makes it possible to encode data into a format that is both human and machine-readable. Without formatting instructions provided by JSON and XML, machines would lack the context necessary to express and analyze documents. However, what happens when data cannot be expressed as a document?

How to Build Scalable, Stateful Java Services in Under 15 Minutes

Five years ago when I started tracking media buzz around stateful architectures, I’d see a few articles every month about running stateful containers. That's about when Caitie McCaffrey first shared this awesome presentation about building scalable stateful architectures. Since then, the dominant software paradigm has become functional application design. The actor model and other object-oriented paradigms are still in use, but database-centric RESTful architectures are standard means of building web applications today.

However, the tides are beginning to shift. Due to innovations like the blockchain, growing demand for real-time applications, the digitization of OT assets, and the proliferation of cheap compute resources at the network edge; there’s renewed interest in decentralized application architectures. As such, there’s also been increased focus on stateful applications. For example, at least five Apache Foundation projects (Beam, Flink, Spark, Samza, and TomEE) are touting statefulness as a benefit today. Modern applications communicate across multiple application silos and must span real-world machines, devices, and distributed data centers around the world. Stateful application architectures provide a way to abstract away the logistical effort of state management, thereby reducing development and management effort necessary to operate massive-scale distributed applications.

Build a Scalable, Stateful To-Do List in 15 Minutes or Less

For the rest of this post, I want to disprove the notion that building scalable, stateful applications is a task too complex for everyday Java developers. In order to illustrate how easily a stateful application can be setup, we’ll walk through a tutorial for building a simple to-do list using the Swim platform. You can find all the source code for the to-do list tutorial here on GitHub.