Part 2 – How to Hive on GCP using Google DataProc and Cloud Storage

In part 1 of this series, we have seen how to create a Google Dataproc cluster, create external tables in HIVE, point to the data stored on cloud storage, and perform exploratory data analysis in a staging environment. As part of this analysis, we found out that our sample datasets had around:

  • ~ 11% of non-confirming records for Green Taxi Y2019 dataset
  • ~ 33% of non-confirming records for Yellow Taxi Y2019 dataset

Identifying non-confirming records is one of the important steps of exploratory data analysis as they can lead to wrong or faulty interpretation of results. So, as part of the next step, we will create a new environment i.e. new external tables in HIVE with only valid data required for deep-dive analysis and eliminate the non-confirming records.