Five Tips to Fasten Your Skewed Joins in Apache Spark

Joins are one of the most fundamental transformations in a typical data processing routine. A Join operator makes it possible to correlate, enrich and filter across two input datasets. The two input datasets are generally classified as a left dataset and a right dataset based on their placing with respect to the Join clause/operator.

Fundamentally, a Join works on a conditional statement that includes a boolean expression based on the comparison between a left key derived from a record from the left dataset and a right key derived from a record from the right dataset. The left and the right keys are generally called ‘Join Keys’. The boolean expression is evaluated against each pair of records across the two input datasets. Based on the boolean output from the evaluation of the expression, the conditional statement includes a selection clause to select either one of the records from the pair or a combined record of the records forming the pair.