HDFS Architecture and Functioning

First of all, thank you for the overwhelming response to my previous article (Big Data and Hadoop: An Introduction). In my previous article, I gave a brief overview of Hadoop and its benefits. If you have not read it yet, please spend some time to get a glimpse into this rapidly growing technology. In this article, we will be taking a deep dive into the file system used by Hadoop called HDFS (Hadoop Distributed File System).

HDFS is the storage part of the Hadoop System. It is a block-structured file system where each file is divided into blocks of a predetermined size. These blocks are stored across a cluster of one or several machines. HDFS works with two types of nodes: NameNode (master) and DataNodes (slave). So let's dive.

Automating Hadoop Computations on AWS

The Hadoop framework provides a lot of useful tools for big data projects. But it is too complex to manage it all by yourself. Several months ago, I was deploying a Hadoop cluster using Cloudera. And I discovered that it works well only for an architecture in which compute and storage capacity is constant. It is a nightmare to use a tool like Cloudera for a system that needs to scale. That is where cloud technologies come in and make our life easier. Amazon Web Services (AWS) is the best option for this use case. AWS provides a managed solution for Hadoop called Elastic Map Reduce (EMR). EMR allows developers to quickly start Hadoop clusters, do the necessary computations, and terminate them when all the work is done. To automate this process even further, AWS provides an SDK for EMR services. Using it, you can launch your Hadoop task with a single command. I'll show how it is done in an example below.

I am going to execute a Spark job on a Hadoop cluster in EMR. My goal will be to compute average comment length for each star rating (1-5) for a large dataset of customer reviews on amazon.com. Usually, to execute Hadoop computations, we need all the data to be stored in HDFS. But EMR integrates with S3 and we don’t need to launch data instances and copy large amounts of it for the sake of a two-minute computation. This compatibility with S3 is a big advantage of using EMR. Many datasets are distributed using S3, including the one I’m using in this example (you can find it here).