On Some Aspects of Big Data Processing in Apache Spark, Part 3: How To Deal With Malformed Data?

In my previous post, I presented design patterns to program Spark applications in a modular, maintainable, and serializable way. This time I demonstrate a solution to deal with malformed date/time data, and how to set a default value to malformed data.

When I worked on a big data project, my tasks were to load data in different formats (JSON, orc, etc) from different sources (Kafka, Hadoop Distributed File System, Apache Hive, Postgres, Oracle), then transform the data, and to save the data to the same or different sources. The simplest task was to load data from a single data source (Postgres), and then save the data to another source (Hive), without any transformations.