Capacity and Compliance in Hybrid Cloud, Multi-Tenant Big Data Platforms

As organizations are realizing how Data-Driven insights can empower their strategic decisions and increase their ROI, the focus is on building Data Lakes and Data Warehouses where all the Big Data can be safely archived. Big data can then be used to empower various data engineering, data science, business analytics, and operational analytics initiatives to benefit the business by improving operational efficiency, reducing operating costs, and making better strategic business decisions. However, the exponential growth in the data that we humans consume and generate day to day makes it necessary to have a well-structured approach toward capacity governance in the Big Data Platform.


Capacity governance and scalability engineering are inter-related disciplines, as this requires a comprehensive understanding of our compute and storage capacity demands, infrastructure supply, and their inter-dynamics to develop an appropriate strategy for scalability in the big data platform. In addition to this, technical risk resolution and security compliance are equally important aspects of capacity governance.

How (and Why) to Move from Spark on YARN to Kubernetes

Apache Spark is among the most usable open-source distributed computing frameworks because it allows data engineers to parallelize the processing of large amounts of data across a cluster of machines.

When it comes to data operations, Spark provides a tremendous advantage as a resource for data operations because it aligns with the things that make data ops valuable. It is optimized for machine learning and AI, which are used for batch processing (in real-time and at scale), and it is adept at operating within different types of environments.

MapReduce and Yarn: Hadoop Processing Unit Part 1

In my previous article, HDFS Architecture and Functionality, I’ve described the filesystem of Hadoop. Today, we will be learning about the processing unit of it. There are mainly two mechanisms by which processing takes place in a Hadoop cluster, namely, MapReduce and YARN. In our traditional system, the major focus is on bringing data to the storage unit. In the Hadoop process, the focus is shifted towards bringing the processing power to the data to initiate parallel processing. So, here, we will be going through MapReduce and, in part two, YARN.


As the name suggests, processing mainly takes place in two steps, mapping and reducing. There is a single master (Job tracker) that controls ob execution on multiple slaves (Task tracker). The Job Tracker accepts MapReduce jobs submitted by the client. It pushes a map and reduce tasks out to Task Tracker and also monitors their status. Task trackers' major function is to run the map and reduce tasks. They also manage and store the intermediate output of the tasks.

Node.js and Yarn for Happy Local Package Development

This is not another praise piece for npm package management with Yarn, but rather a concise recipe for working with locally developed packages.

npm modules begin their lives when you init them on your local dev machine, but there comes a point when you want to test them out or simply use them with other Node.js projects you have.