Why Your Spark Apps Are Slow Or Failing, Part II: Data Skew and Garbage Collection

The second part of the series “Why Your Spark Apps Are Slow or Failing” follows Part I on memory management and deals with issues that arise with data skew and garbage collection in Spark. Like many performance challenges with Spark, the symptoms increase as the scale of the data handled by the application increases.

What Is Data Skew?

In an ideal Spark application run, when Spark wants to perform a join, for example, join keys would be evenly distributed and each partition that needed processing would be nicely organized. However, real business data is rarely so neat and cooperative. We often end up with less than ideal data organization across the Spark cluster that results in degraded performance due to data skew.