Configuring Spark-Submit

In the vast landscape of big data processing, Apache Spark stands out as a powerful and versatile framework. While developing Spark applications is crucial, deploying and executing them efficiently is equally vital. One key aspect of deploying Spark applications is the use of "spark-submit," a command-line interface that facilitates the submission of Spark applications to a cluster.

Understanding Spark Submit

At its core, spark-submit is the entry point for submitting Spark applications. Whether you are dealing with a standalone cluster, Apache Mesos, Hadoop YARN, or Kubernetes, spark-submit acts as the bridge between your developed Spark code and the cluster where it will be executed.

Let’s Unblock: Spark Setup (Intellij)

Before getting our hands dirty into code let's set up our environment and prepare our IDE to understand the Scala language and SBT plugins.

Prerequisites:

  1. Java (preferably JDK 8+).
  2. Intellij IDE (community or ultimate).

IDE Setup:

So let's start configuring the plugin required for Scala and SBT environment by the following steps:

Deno JS: Introduction

Joke

The story ages to the time of dinosaurs. The gigantic reptiles that flourish on Earth millions of years ago. After completing their time, they extinct. As Justin Timberlake said, "what goes around, Comes back around". The same happens to dinosaurs, in this digital age they return with a slang name DENO. Again a bad analogy to start with, please pardon me with this.

Definition

So according to, deno.land,  Deno is a simple, modern, and secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust, and Tokio. Sounding like Node it is made by the developer, Ryan Dahl, of its anagram. As a first time user for deno, it felt like a node in a new package there is a substantial difference between the two on which we will be coming in the latter part. 

Deno JS: CRUD and MySQL Connection

Deno.js is a new backend language based on the javascript framework. Deno is a simple, modern, and secure runtime for JavaScript and TypeScript that uses V8 and is built in Rust. In this tutorial, we'll learn how to develop a complete CRUD web application using Deno js (Oak as a framework) and using Mysql as database. 

Overview

This project takes an example of an Employee object having four attributes id, name, department, and isActive. We will proceed with adding an employee object in our DB and then performing further operations.

MapReduce and Yarn: Hadoop Processing Unit Part 1

In my previous article, HDFS Architecture and Functionality, I’ve described the filesystem of Hadoop. Today, we will be learning about the processing unit of it. There are mainly two mechanisms by which processing takes place in a Hadoop cluster, namely, MapReduce and YARN. In our traditional system, the major focus is on bringing data to the storage unit. In the Hadoop process, the focus is shifted towards bringing the processing power to the data to initiate parallel processing. So, here, we will be going through MapReduce and, in part two, YARN.

Mapreduce

As the name suggests, processing mainly takes place in two steps, mapping and reducing. There is a single master (Job tracker) that controls ob execution on multiple slaves (Task tracker). The Job Tracker accepts MapReduce jobs submitted by the client. It pushes a map and reduce tasks out to Task Tracker and also monitors their status. Task trackers' major function is to run the map and reduce tasks. They also manage and store the intermediate output of the tasks.

HDFS Architecture and Functioning

First of all, thank you for the overwhelming response to my previous article (Big Data and Hadoop: An Introduction). In my previous article, I gave a brief overview of Hadoop and its benefits. If you have not read it yet, please spend some time to get a glimpse into this rapidly growing technology. In this article, we will be taking a deep dive into the file system used by Hadoop called HDFS (Hadoop Distributed File System).

HDFS is the storage part of the Hadoop System. It is a block-structured file system where each file is divided into blocks of a predetermined size. These blocks are stored across a cluster of one or several machines. HDFS works with two types of nodes: NameNode (master) and DataNodes (slave). So let's dive.

Big Data and Hadoop: An Introduction

A very common misconception is that big data is some technology or tool. Big data, in reality, is a very large, heterogeneous set of data. This data comes more in ab unstructured or semi-structured form, so extracting useful information is very difficult. With the growth of cloud technologies, the generation rate of data has increased tremendously.

Therefore, we need a solution that allows us to process such "Big Data" at optimal speed and to do so without compromising data security. There are a cluster of technologies that deal with this and one of the best is Hadoop.