MapReduce Algorithms: Understanding Data Joins, Part II

hadoop-logoIt’s been awhile since I last posted, and like last time I took a big break, I was taking some classes on Coursera. This time it was Functional Programming Principals in Scala and Principles of Reactive Programming. I found both of them to be great courses and would recommend taking either one if you have the time. In this post we resume our series on implementing the algorithms found in Data-Intensive Text Processing with MapReduce, this time covering map-side joins. As we can guess from the name, map-side joins join data exclusively during the mapping phase and completely skip the reducing phase. In the last post on data joins we covered reduce side joins. Reduce-side joins are easy to implement, but have the drawback that all data is sent across the network to the reducers. Map-side joins offer substantial gains in performance since we are avoiding the cost of sending data across the network. However, unlike reduce-side joins, map-side joins require very specific criteria be met. Today we will discuss the requirements for map-side joins and how we can implement them.

Map-Side Join Conditions

To take advantage of map-side joins our data must meet one of following criteria:

When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake

Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.

If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let's face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.

Data Processing Using Functions in Prosto: An Alternative to Map-Reduce and SQL

Why Prosto? Having Only Set Operations Is Not Enough

Typical data processing tasks have to access and analyze data stored in multiple tables. They could be called relations, collections, or sets in different systems but we will refer to them as tables for simplicity. The general task of data processing is to derive new data from these multiple tables and each solid data (processing) model must answer the following three important questions: how to compute new columns within one table, how to link tables and how to aggregate data. Below we shortly describe how these tasks are solved in a traditional set-oriented model and where these solutions have significant flaws.

Calculation. Given a table, we frequently need to add a new column with values computed from other columns in this same table. Conceptually, the task is similar to defining a cell in a spreadsheet, for example, C1=A1+B1. Easy and natural? Yes. However, it is not so easy in traditional data processing frameworks. The main problem is that we need to define a new table because adding a column to an existing table is not possible. Prosto toolkit is intended to fix this flaw by providing a dedicated operation where a new column can be added as in this example: ColumnC=ColumnA+ColumnB

MapReduce and Yarn: Hadoop Processing Unit Part 1

In my previous article, HDFS Architecture and Functionality, I’ve described the filesystem of Hadoop. Today, we will be learning about the processing unit of it. There are mainly two mechanisms by which processing takes place in a Hadoop cluster, namely, MapReduce and YARN. In our traditional system, the major focus is on bringing data to the storage unit. In the Hadoop process, the focus is shifted towards bringing the processing power to the data to initiate parallel processing. So, here, we will be going through MapReduce and, in part two, YARN.

Mapreduce

As the name suggests, processing mainly takes place in two steps, mapping and reducing. There is a single master (Job tracker) that controls ob execution on multiple slaves (Task tracker). The Job Tracker accepts MapReduce jobs submitted by the client. It pushes a map and reduce tasks out to Task Tracker and also monitors their status. Task trackers' major function is to run the map and reduce tasks. They also manage and store the intermediate output of the tasks.

One Challenge With 10 Solutions

Technologies we use for Data Analytics have evolved a lot, recently. Good old relational database systems become less popular every day. Now, we have to find our way through several new technologies, which can handle big (and streaming) data, preferably on distributed environments.

Python is all the rage now, but of course there are lots of alternatives as well. SQL will always shine, and some other oldies-but-goldies, which we can never under-estimate, are still out there.

Drilling Into Big Data: A Gold Mine of Information

The volume of data generated every day is a mystery as it is increasing continually at a rapid rate. Although data is everywhere, the intelligence that we can glean from it matters more. These large volumes of data are what we call "Big Data." Organizations generate and gather huge volumes of data believing that this data might help them in advancing their products and improving their services. For example, a shop may have its customer information, stock details, purchase history, and website visits.

Often times, organizations store this data for regular business activities but fail to use it for further Analytics and Business Relationships. The data which is unanalyzed and left unused is what we call "Dark Data."

Big Data and Hadoop: An Introduction

A very common misconception is that big data is some technology or tool. Big data, in reality, is a very large, heterogeneous set of data. This data comes more in ab unstructured or semi-structured form, so extracting useful information is very difficult. With the growth of cloud technologies, the generation rate of data has increased tremendously.

Therefore, we need a solution that allows us to process such "Big Data" at optimal speed and to do so without compromising data security. There are a cluster of technologies that deal with this and one of the best is Hadoop.