Rohit Chaudhary | The Blog Pros

A perpetual debate rages about the effectiveness of a modern-day Data Analyst in a Distributed Computing environment. Analysts are used to SQL’s returning answers to their questions in short order. The RDBMS user is often unable to comprehend the root-cause when queries don’t return results for multiple hours. The opinions are divided, despite broad acceptance of the fact that Query Engines such as Hive and Spark are complex for the best engineers. At Acceldata, we see full TableScans run on multi-Tera Byte tables to get a count of rows, which to say the least is taboo in the Hadoop world. What results is a frustrating conversation between Cluster Admins and Data Users, which is devoid of data that is hard to collect. It is also a fact that data needs conversion into insights to make business decisions. More importantly, the value in Big Data needs to be unlocked without delays.

From here we start from the point where the Hadoop Admin/Engineer is ready to unravel the scores of metrics and interpret the reasons for poor performance and taking resources away from the cluster causing:

Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.

If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let's face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.

Author: Rohit Chaudhary

5 Essential Diagnostic Views to Fix Hive Queries

When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake