On Some Aspects of Big Data Processing in Apache Spark, Part 4: Versatile JSON and YAML Parsers

In my previous post, I presented design patterns to program Spark applications in a modular, maintainable, and serializable way—this time I demonstrate how to configure versatile JSON and YAML parsers to be used in Spark applications. 

A Spark application typically needs to ingest JSON data to transform the data, and then save the data in a data source. On the other hand, YAML data is needed primarily to configure Spark jobs. In both cases, the data needs to be parsed according to a predefined template. In a Java Spark application, these templates are POJOs. How to program a single parser method to process a wide class of such POJO templates with data taken from local or distributed file systems?

On Some Aspects of Big Data Processing in Apache Spark, Part 3: How To Deal With Malformed Data?

In my previous post, I presented design patterns to program Spark applications in a modular, maintainable, and serializable way. This time I demonstrate a solution to deal with malformed date/time data, and how to set a default value to malformed data.

When I worked on a big data project, my tasks were to load data in different formats (JSON, orc, etc) from different sources (Kafka, Hadoop Distributed File System, Apache Hive, Postgres, Oracle), then transform the data, and to save the data to the same or different sources. The simplest task was to load data from a single data source (Postgres), and then save the data to another source (Hive), without any transformations. 

On Some Aspects of Big Data Processing in Apache Spark, Part 2: Useful Design Patterns

In my previous post, I demonstrated how Spark creates and serializes tasks. In this post, I show how to utilize this knowledge to construct Spark applications in a maintainable and upgradable way, where at the same time "task not serializable" exceptions are avoided.

When I participated in a big data project, I needed to program Spark applications to move and transform data from/to relational and distributed databases, like Apache Hive. I found such applications to have a number of pitfalls, so all "hard to read code," "method is too large to fit into a single screen," etc. problems need to be avoided for us to focus on deeper issues. Also, Spark jobs are similar: data is loaded from a single or multiple databases, gets transformed, then saved to a single or multiple databases. So it seems reasonable to try to use GoF patterns to program Spark applications. 

On Some Aspects of Big Data Processing in Apache Spark, Part 1: Serialization

Many beginner Spark programmers encounter a "Task not serializable" exception when they try to break their Spark applications into Java classes. There are a number of posts to instruct developers on how to solve this problem. Also, there are excellent overviews of Spark. Nonetheless, I think it is worthwhile to look at the Spark source code to see where and how tasks get serialized and such exceptions are thrown to better understand those instructions.  

This post is organized as follows: