High-Performance Batch Processing Using Apache Spark and Spring Batch

Batch processing is dealing with a large amount of data; it actually is a method of running high-volume, repetitive data jobs and each job does a specific task without user interaction. This kind of processing started from the beginning of computation and still continues and perhaps new methods, algorithms, and tools are still being introduced. The entry point of Batch Processing Systems is offline data that gathered in any type like CSV, RDBMS, or well-formed JSON or XML files, after processing data and doing the specified process on every record of data, the desired output is generated.

Note: Batch processing and stream processing are two different concepts.

Spring Batch — Read From XML

In this example, I will show you how to read from XML, and for now, simply print the details on console. However, you can save this XML data into Relational or Non Relational or write it into CSV. I will write other articles to cover such topics.

I have used the latest version of Spring Boot, Spring Batch:

Running Spring Batch Applications in PCF

1. Overview

Most developers are creating microservices these days and deploying to cloud platforms. Pivotal Cloud Foundry (PCF) is one of the most well known cloud platforms. When we talk about deploying applications on PCF, mostly they would be long-running processes which never end, like web applications, SPAs, or REST-based services. PCF monitors all these long-running instances and if one goes down it spins up the new instance to replace the failed one. This works fine where the process is expected to run continuously but for a batch process, it's overkill. The container will be running all the time with no CPU usage and the cost will add up. Many developers have an ambiguity that PCF cannot run a batch application which can just be initiated based on a request. But that is not correct.

Spring Batch enables us to create a batch application and provides many out-of-the-box features to reduce boilerplate code. Recently, Spring Cloud Task has been added in the list of projects to create short-running processes. With both of these options, we can create microservices, deploy them on PCF, and then stop them so that PCF doesn't try to self heal them. And with the help of PCF Scheduler, we can schedule the task to run them at a certain time of the day. Let's see in this article how we can do that with very few steps.