Install and Configuration of Apache Hive-3.1.2 on Multi-Node

The Apache Hive is a data warehouse system built on top of the Apache Hadoop. Hive can be utilized for easy data summarization, ad-hoc queries, analysis of large datasets stores in various databases or file systems integrated with Hadoop. Ideally, we use Hive to apply structure (tables) on persisted a large amount of unstructured data in HDFS and subsequently query those data for analysis. 

The objective of this article is to provide step by step procedure in sequence to install and configure the latest version of Apache Hive (3.1.2) on top of the existing multi-node Hadoop cluster. In a future post, I will detail how we can use Kibana for data visualization by integrating Elastic Search with Hive. Apache Hadoop — 3.2.0 was deployed and running successfully in the cluster. Here is the list of environment and required components.

Storing and Aggregating Time Series Data With Elastic Search

When talking about time series data, the data will be very huge. The number of records increases based on the granularity level. If the granularity is minute, we will get 60 records for one minute for one instance.

For example, we want to store CPU percentage of a device for each minute. So let's assume we are getting data for the last 30 days.

How to Build your First Real-Time Streaming (CDC) System Part 1

Introduction

With the exponential growth of data and a lot of businesses moving online, it has become imperative to design systems that can act in real-time or near real-time to make any business decisions. So, after working on multiple backend projects through many years, I finally got to do build a real-time streaming platform. While working on the project, I did start experimenting with different tech stacks to deal with this. So, I am trying to share my learnings in a series of articles. Here is the first of them.

Target Audience

This post is aimed at engineers who are already familiar with microservices and Java and are looking to build their first real-time streaming pipeline. This POC is divided into 4 articles for the purpose of readability. They are as follows:

Sprinkle Some ELK on Your Spring Boot Logs

 

One day, I heard about the ELK stack and about its advantages, so I decided to get my hands on it. Unfortunately, I struggled to find solid documentation and supplemental content on getting started. So, I decided to write my own.

Elasticsearch on Google Cloud Platform

This tutorial focuses on how one can set up Elasticsearch on Google Cloud Platform (GCP). At the end of this tutorial you will be able to connect to an Elasticsearch instance and use it. In this tutorial, we will deploy Elasticsearch on a compute VM (instead of using on-click install, for the kicks of it.

Before You Start

A basic understanding of Elasticsearch is useful. If new you could browse though below first