Apache Spark: Resilient Distributed Datasets

RDDs represent both the idea of how a large dataset is represented in Apache Spark and the abstraction for working with it. This section will cover the former, and the following sections will cover the latter. According to the seminal paper on Spark, "RDDs are immutable, fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators." Let’s dissect this description to truly understand the ideas behind the RDD concept.

Immutable

 RDDs are designed to be immutable, which means you can’t specifically modify a particular row in the dataset represented by that RDD. You can call one of the available RDD operations to manipulate the rows in the RDD into the way you want, but that operation will return a new RDD. The basic RDD will stay unchanged, and the new RDD will contain the data in the way that you altered it. The immutability requires an RDD to carry its lineage information that Spark leverages to efficiently provide fault tolerance capabilities.