MapReduce Algorithms: Understanding Data Joins, Part II

hadoop-logoIt’s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have the time. In this post we resume our series on implementing the algorithms found in Data-Intensive Text Processing with MapReduce, this time covering map-side joins. As we can guess from the name, map-side joins join data exclusively during the mapping phase and completely skip the reducing phase. In the last post on data joins we covered reduce side joins. Reduce-side joins are easy to implement, but have the drawback that all data is sent across the network to the reducers. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network. However, unlike reduce-side joins, map-side joins require very specific criteria be met. Today we will discuss the requirements for map-side joins and how we can implement them.

Map-Side Join Conditions

To take advantage of map-side joins our data must meet one of following criteria:

Import Data From Hadoop Into CockroachDB

CockroachDB can natively import data from HTTP endpoints, object storage with respective APIs, and local/NFS mounts. The full list of supported schemes can be found here.

It does not support the HDFS file scheme and we're left to our wild imagination to find alternatives.
As previously discussed, the Hadoop community is working on Hadoop Ozone, a native scalable object store with S3 API compatibility. For reference, here's my article demonstrating CockroachDB and Ozone integration. The limitation here is that you need to run Hadoop 3 to get access to it. 

What if you're on Hadoop 2? There are several choices I can think of off the top of my head. One approach is to expose webhdfs and IMPORT using an http endpoint. The second option is to leverage previously discussed Minio to expose HDFS via HTTP or S3. Today, we're going to look at both approaches.

My setup consists of a single-node pseudo-distributed Hadoop cluster with Apache Hadoop 2.10.0 running inside a VM provisioned by Vagrant. Minio runs as a service inside the VM and CockroachDB is running inside a docker container on my host machine.
  • Information on CockroachDB can be found here.
  • Information on Hadoop Ozone can be found here.
  • Information on Minio can be found here.
  1. Upload a file to HDFS.

I have a CSV file I created with my favorite data generator tool, Mockaroo.

curl "https://api.mockaroo.com/api/38ef0ea0?count=1000&key=a2efab40" > "part5.csv"
hdfs dfs -mkdir /data
hdfs dfs -chmod -R 777 /data
hdfs dfs -put part5.csv /data


Serverless Kafka in a Cloud-Native Data Lake Architecture

Apache Kafka became the de facto standard for processing data in motion. Kafka is open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use a serverless Kafka SaaS offering to focus on business logic. However, hybrid scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This blog post explores how to leverage cloud-native and serverless Kafka offerings in a hybrid cloud architecture. We start from the perspective of data at rest with a data lake and explore its relation to data in motion with Kafka.

Data at Rest - Still the Right Approach?

Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases - even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

Top 10 June ’21 Big Data Articles to Read Now

Introduction

Big Data is now adapted by a lot of businesses. Its popularity and use are expanding globally. How awesome would it be to find top trending Big Data articles in one place so that you can always stay up to date with the latest trends in technology? We dug into Google analytics to find the top 10 most popular Big Data articles in June. Let's get started!

10. Kafka Administration and Monitoring UI Tools

Kafka is used for streaming data and much more! This article covers Kafka basics and Kafta Administration, Kafka Manager, and Monitoring tools. 

5 Essential Diagnostic Views to Fix Hive Queries

A perpetual debate rages about the effectiveness of a modern-day Data Analyst in a Distributed Computing environment. Analysts are used to SQL’s returning answers to their questions in short order. The RDBMS user is often unable to comprehend the root-cause when queries don’t return results for multiple hours. The opinions are divided, despite broad acceptance of the fact that Query Engines such as Hive and Spark are complex for the best engineers. At Acceldata, we see full TableScans run on multi-Tera Byte tables to get a count of rows, which to say the least is taboo in the Hadoop world. What results is a frustrating conversation between Cluster Admins and Data Users, which is devoid of data that is hard to collect. It is also a fact that data needs conversion into insights to make business decisions. More importantly, the value in Big Data needs to be unlocked without delays.

From here we start from the point where the Hadoop Admin/Engineer is ready to unravel the scores of metrics and interpret the reasons for poor performance and taking resources away from the cluster causing:

Big-Data Project Guidelines

The aim of the following article is to share with you some of the most relevant guidelines in cloud-based big data-based projects that I’ve done in my recent mission. The following list is not an exhaustive one and may be completed according to each organization/project specifications.

Guidelines for Cloud-Based and Data-Based Projects

Data Storage

  • Use data partitioning and/or bucketing to reduce the amount of data to process.
  • Use Parquet format to reduce the amount of data to store.
  • Prefer using SNAPPY compression for frequently consumed data, and prefer using GZIP compression for data that is infrequently consumed.
  • Try as much as possible to store a big enough file instead of many small files (average 256MO - 1GB ) to improve performances (R/W) and reduce costs — the file system depends on the use case needs and the underlying block storage file system.
  • Think about a DeltaLake/IceBerg framework before managing schema evolutions and data updates using custom solutions.
  • “Design by query” can help improving consumption performances — for instance, you can store the same data in different designs in different depths depending on the consumption pattern.
  • Secure data stored on S3 using an adapted model using versioning and archiving.

Data Processing

  • When developing distributed applications, re-think your code in a way to avoid as much as possible data shuffling as it leads to performance leaks.
  • Small table broadcasting can help achieve better performances.
  • Once again, use Parquet format to reduce the amount of data to process thanks to PredicatePushDown and ProjectionPushDown.
  • When consuming data, use as much as possible data native protocols to be close to data and avoid unnecessary calls and protocols overhead.
  • Before choosing a computation framework, identify first if your problem needs to be solved using parallelization or using distribution.
  • Think to merge and compact your files to improve performances and reduce cost while reading (Delta.io can help achieve that),

Data Locality

  • Move the processing next to data and not the opposite — data size is generally higher than the jobs' or scripts' sizes.
  • Process data in the cloud and get only the most relevant and necessary data out.
  • Limit the inter-region transfer.
  • Avoid as much data travel as possible between infrastructures, proxies, and patterns.

Various

  • Use a data lake for analytical & exploratory use cases and use operational databases for operational ones.
  • To ingest data, prefer using config-based solutions like DMS/Debezuim rather than custom solutions. Also prefer also using CDC solutions for long-term running ingests.
  • Make structures (table, prefix, path…) that are not aimed to be shared privately.

SQL as an Essential Tool for the Big Data Landscape

Introduction

The recent advancements in big data systems have led to faster processing, efficient distribution, and data storage in data lakes and data warehouses. This has led to the great migration of analytics in the traditional relational database world to the big data landscape. 

The transition wasn’t as hard as expected thanks to the existence of SQL as a query language in big data systems that are highly distributed and scalable. 

Big Data Trends to Consider in 2021

Intro

Big data is growing so fast it's almost hard to imagine. According to some studies there are 40 times more bytes in the world than there are stars in the observable universe. There is simply an unimaginable amount of data being produced by billions of people every single day. The global market size predictions prove it beyond any doubt.

It’s not a question of if you will use big data in your daily business routine, it’s when you’re going to start using it (if somehow you haven’t yet). Big data is here and it’s here to stay for the foreseeable future.

‘mapPartitions’ in Apache Spark: 5 Key Benefits

'mapPartitions' is the only narrow transformation, being provided by Apache Spark Framework, to achieve partition-wise processing, meaning, process data partitions as a whole.  All the other narrow transformations, such as map, flatmap, etc. process partitions record-wise. 'mapPartitions', if used judiciously, can speed up the performance and efficiency of the underlying Spark Job manifold. 

'mapPartitions' provides an iterator to the partition data to the computing function and expects an iterator to a new data collection as the return value from the computing function. Below is the 'mapPartitions' API applicable on a Dataset of type <T> expecting a functional interface of type 'MapPartitionsFunction' to process each data partition as a whole along with an Encoder of the type <U>, <U> being representing the returned data type in the returned Dataset.     

Identify and Resolve Stragglers in Your Spark Application

Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures. 

What Is a Straggler in a Spark Application?

 A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. There could be multiple stragglers in a Spark Job being present either in the same stage or across multiple stages. 

Deep Dive Into Join Execution in Apache Spark

Join operations are often used in a typical data analytics flow in order to correlate two data sets. Apache Spark, being a unified analytics engine, has also provided a solid foundation to execute a wide variety of Join scenarios.

At a very high level, Join operates on two input data sets and the operation works by matching each of the data records belonging to one of the input data sets with every other data record belonging to another input data set. On finding a match or a non-match (as per a given condition), the Join operation could either output an individual record, being matched, from either of the two data sets or a Joined record. The joined record basically represents the combination of individual records, being matched, from both the data sets.

A Credible Approach of Big Data Processing and Analysis of Telecom Data

Telecom providers have a treasure trove of captive data — customer data, CDR (call detail records), call center interactions, tower logs, etc. and are metaphorically “sitting on a gold mine.” Ideally, each category of the generated data has the following information:

  • Customer data consolidate customer id, plan details, demographic, subscribed services, and spending patterns
  • Service data category consolidates types of customer, customer history, complain category, query resolved, etc. are on
  • Usually for the smart mobile phone subscriber, location category data consolidates GPS-based data, roaming data, current location, frequently visited location, etc.

Why Disintegration of Apache Zookeeper From Kafka Is in the Pipeline

The main objective of this article is to highlight why to cut the bridge between Apache Zookeeper and Kafka which is an upcoming project from the Apache software foundation. Also, the proposed architecture/solution aims to make the Kafka completely independent in delivering the entire functionalities that currently offering today with Zookeeper.

Article Structure

This article has been segmented into 4 parts.

Reducing Large S3 API Costs Using Alluxio

I. Introduction

Previous Works

There have been numerous articles and online webinars dealing with the benefits of using Alluxio as an intermediate storage layer between the S3 data storage and the data processing system used for ingestion or retrieval of data (i.e. Spark, Presto), as depicted in the picture below:

To name a few use cases:

MapReduce and Yarn: Hadoop Processing Unit Part 1

In my previous article, HDFS Architecture and Functionality, I’ve described the filesystem of Hadoop. Today, we will be learning about the processing unit of it. There are mainly two mechanisms by which processing takes place in a Hadoop cluster, namely, MapReduce and YARN. In our traditional system, the major focus is on bringing data to the storage unit. In the Hadoop process, the focus is shifted towards bringing the processing power to the data to initiate parallel processing. So, here, we will be going through MapReduce and, in part two, YARN.

Mapreduce

As the name suggests, processing mainly takes place in two steps, mapping and reducing. There is a single master (Job tracker) that controls ob execution on multiple slaves (Task tracker). The Job Tracker accepts MapReduce jobs submitted by the client. It pushes a map and reduce tasks out to Task Tracker and also monitors their status. Task trackers' major function is to run the map and reduce tasks. They also manage and store the intermediate output of the tasks.

Hadoop Ecosystem: Hadoop Tools for Crunching Big Data

Hadoop Ecosystem

In this blog, let's understand the Hadoop Ecosystem. It is an essential topic to understand before you start working with Hadoop. This Hadoop ecosystem blog will familiarize you with industry-wide used Big Data frameworks, required for a Hadoop certification.

The Hadoop Ecosystem is neither a programming language nor a service; it is a platform or framework which solves big data problems. You can consider it as a suite that encompasses a number of services (ingesting, storing, analyzing, and maintaining) inside it. Let us discuss and get a brief idea about how the services work individually and in collaboration.