Let’s Unblock: Spark Setup (Intellij)

Before getting our hands dirty into code let's set up our environment and prepare our IDE to understand the Scala language and SBT plugins.

Prerequisites:

  1. Java (preferably JDK 8+).
  2. Intellij IDE (community or ultimate).

IDE Setup:

So let's start configuring the plugin required for Scala and SBT environment by the following steps:

Accumulator and Broadcast Variables in Spark

At a high level, accumulators and broadcast variables both are Spark-shared variables. In distributed computing, understanding closure is very important. Often, it creates confusion among programmers in understanding the scope and life cycle of variables and methods while executing code in a cluster. Most of the time, you will end up getting :

org.apache.spark.SparkException: Job aborted due to stage failure: 
Task not serializable: java.io.NotSerializableException: ...

Apache Spark for the Impatient

Below is a list of the most important topics in Spark that everyone who does not have the time to go through an entire book but wants to discover the amazing power of this distributed computing framework should definitely go through before starting.

Architecture

Spark Architecture Diagram

Tips and Best Practices to Take Advantage of Spark 2.x

With Apache Spark 2.0 and later versions, big improvements were implemented to enable Spark to execute faster, making a lot of earlier tips and best practices obsolete. This post will first give a quick overview of what changes were made and then some tips to take advantage of these changes.

This post is an excerpt from the eBook Getting Started with Spark 2.x: From Inception to Production, which you can download to learn more about Spark 2.x.