Big Data File Formats Explained

Apache Spark supports many different data formats, such as the ubiquitous CSV format and web-friendly JSON format. Common formats used primarily for big data analytical purposes are Apache Parquet and Apache Avro.

In this post, we’re going to cover the properties of these four formats — CSV, JSON, Parquet, and Avro with Apache Spark.

Apache Parquet vs. CSV Files

You have surely read about Google Cloud (i.e. BigQuery, Dataproc), Amazon Redshift Spectrum, and Amazon Athena. Now, you are looking to take advantage of one or two. However, before you jump into the deep end, you will want to familiarize yourself with the opportunities of leveraging Apache Parquet instead of regular text, CSV, or TSV files. If you are not thinking about how to optimize for these new query service models, you are throwing money out the window.

What Is Apache Parquet?

Apache Parquet is a columnar storage format with the following characteristics: