August 9, 2019 by siva sreedhar cynix

Apache Spark: Resilient Distributed Datasets

RDDs represent both the idea of how a large dataset is represented in Apache Spark and the abstraction for working with it. This section will cover the former, and the following sections will cover the latter. According to the seminal paper on Spark, "RDDs are immutable, fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize data placement, and manipulate them using a rich set of operators." Let’s dissect this description to truly understand the ideas behind the RDD concept.

Immutable

RDDs are designed to be immutable, which means you can’t specifically modify a particular row in the dataset represented by that RDD. You can call one of the available RDD operations to manipulate the rows in the RDD into the way you want, but that operation will return a new RDD. The basic RDD will stay unchanged, and the new RDD will contain the data in the way that you altered it. The immutability requires an RDD to carry its lineage information that Spark leverages to efficiently provide fault tolerance capabilities.

Cybersecurity path
In Networking
Does Anyone know how a beginner like me in tech can start learning cyber security or starting a career in cyber engineering? […]
GBase 8a Implementation Guide: Performance Optimization
No categories
1. Hardware Configuration Recommendations CPU Ensure the BIOS settings are in non-power-saving mode to prevent the CPU from throttling. For servers using Intel CPUs that are not deployed in a multi-instance environment, it is recommended to disable the... […]
Build an Advanced RAG App: Query Rewriting
No categories
In the last article, I established the basic architecture for a basic RAG app. In case you missed that, I recommend that you first read that article. That will set the base from which we can improve our RAG system. Also in that last article, I listed s... […]
Extracting YouTube Channel Statistics in Python Using YouTube Data API
In Computer Science
Are you interested in finding out what a YouTube channel mostly discusses? Do you want to analyze YouTube videos of a specific channel? If yes, we are in the same boat. YouTube video titles are a great way to determine the channel's primary focus. Plotting a word cloud or a ... […]
Contexts in Go: A Comprehensive Guide
No categories
Contexts in Go provide a standard way to pass metadata and control signals between goroutines. They are mainly used to manage task execution time, data passing, and operation cancellation. This article covers different types of contexts in Go and examp... […]

Proudly powered by WordPress