The Lakehouse: An Uplift of Data Warehouse Architecture

In short, the initial architecture of the data warehouse was designed to provide analytical insights by collecting data from various heterogeneous data sources into the centralized repository and acted as a fulcrum for decision support and business intelligence (BI). But it has continued with numerous challenges, like more time consumed on data model designing because it only supports schema-on-write, the inability to store unstructured data, tight integration of computing, and storage into an on-premises appliance, etc.

This article intends to highlight how the architectural pattern is enhanced to transform the traditional data warehouse by rolling over the second-generation platform data lake and eventually turning it into a lakehouse. Although the present data warehouse supports three-tier architecture with an online analytical processing (OLAP) server as the middle tier, it is still a consolidated platform for machine learning and data science with metadata, caching, and indexing layers that are not yet available as a separate tier.

Import Data From Hadoop Into CockroachDB

CockroachDB can natively import data from HTTP endpoints, object storage with respective APIs, and local/NFS mounts. The full list of supported schemes can be found here.

It does not support the HDFS file scheme and we're left to our wild imagination to find alternatives.
As previously discussed, the Hadoop community is working on Hadoop Ozone, a native scalable object store with S3 API compatibility. For reference, here's my article demonstrating CockroachDB and Ozone integration. The limitation here is that you need to run Hadoop 3 to get access to it. 

What if you're on Hadoop 2? There are several choices I can think of off the top of my head. One approach is to expose webhdfs and IMPORT using an http endpoint. The second option is to leverage previously discussed Minio to expose HDFS via HTTP or S3. Today, we're going to look at both approaches.

My setup consists of a single-node pseudo-distributed Hadoop cluster with Apache Hadoop 2.10.0 running inside a VM provisioned by Vagrant. Minio runs as a service inside the VM and CockroachDB is running inside a docker container on my host machine.
  • Information on CockroachDB can be found here.
  • Information on Hadoop Ozone can be found here.
  • Information on Minio can be found here.
  1. Upload a file to HDFS.

I have a CSV file I created with my favorite data generator tool, Mockaroo.

curl "https://api.mockaroo.com/api/38ef0ea0?count=1000&key=a2efab40" > "part5.csv"
hdfs dfs -mkdir /data
hdfs dfs -chmod -R 777 /data
hdfs dfs -put part5.csv /data


What Is Data Engineering? Skills and Tools Required

In the last decade, as most organizations began receiving advanced change, data scientists and data engineers have developed into two separate jobs, obviously, with specific covers. The business generates data constantly from people and products. Every event is a snapshot of company functions (and dysfunctions) such as revenue, losses, third-party partnerships, and goods received. But if the data isn't explored, there will be no insights gained. The intention of data engineering is to help the process and make it workable for buyers of data. In this article, we’ll explore the definition of data engineering, data engineering skills, what data engineers do and their responsibilities, and the future of data engineering.

Data Engineering: What Is It?

In the world of data, a data scientist is just comparable to the information or data they approach. Most companies store their information or data in an assortment of arrangements across data sets and text formats. This is the situation where data engineering enters. In simple form, data engineering means organizing and designing the data, which is done by the data engineers. They construct data pipelines that change that information, organize them, and make them useful. Data engineering is similarly as significant as data science.  However, data engineering requires realizing how to get an incentive form of data, just as the commonsense designing abilities to move data from guide A toward point B without defilement.

A Short Introduction to Apache Iceberg

Apache Iceberg introduces the concept of hidden partitioning where the reading of unnecessary partitions can be avoided automatically. Data consumers that fire the queries don’t need to know how the table is partitioned and add extra filters to their queries.

A table can be defined as an arrangement of data in rows and columns and in a similar fashion, if you visualize from the Big Data perspective, the large number of individual files that hold the actual data can be organized in a tabular manner too.

Why Lambda Architecture in Big Data Processing?

Due to the exponential growth of digitalization, the entire globe is creating a minimum of 2.5 quintillion (2500000000000 Million) bytes of data every day and that we can denote as Big Data. Data generation is happening from everywhere starting from social media sites, various sensors, satellites, purchase transactions, Mobile, GPS signals, and much more. With the advancement of technology, there is no sign of the slowing down of data generation, instead, it will grow in a massive volume. All the major organizations, retailers, different vertical companies, and enterprise products have started focusing on leveraging big data technologies to produce actionable insights, business expansion, growth, etc. 

Overview

Lambda Architecture is an excellent design framework for the huge volume of data processing using both streaming as well as batch processing methods.  The streaming processing method stands for analyzing the data on the fly when it is in motion without persisting on storage area whereas the batch processing method is applied when data already in rest, means persisted in storage area like databases, data warehousing systems, etc. Lambda Architecture can be utilized effectively to balance latency, throughput, scaling, and fault-tolerance to achieve comprehensive and accurate views from the batch and real-time stream processing simultaneously.

Snapshot: Data Governance and Security Mechanism in Distributed Data Storage System

We are very much aware that the traditional data storage mechanism is incapable to hold the massive volume of lightning speed generated data for further utilization even though perform vertical scaling. And going forward we have anticipated only one fuel which is nothing but DATA to accelerate the movement across all the sectors starting from business to natural resources including medical towards rapid growth. But the question is how to persist this massive volume of data to process? The answer is storing the data in a distributed manner in a multi-node cluster where it can be scaled linearly on demand. The former statement is made physically achievable by Hadoop distributed file system (HDFS). Using HDFS we can store data in a distributed manner (multi-node cluster where the number of nodes can be increased in the cluster linearly as data grows). Using hive, HBase we can organize the HDFS data and make it more meaningful as the data become queryable. To accelerate the movement towards growth as mentioned, the next hurdle is to govern the data and security implication of this huge volume of persisted data. In a single statement, data governance can be defined as the consolidation of managing data access, accountability, and security. By default, HDFS does not provide any strong security mechanism to achieve complete governance but with the additional combination to the following approach, we can proceed towards it.

  • Integration with LDAP – To secure read/write operation on the persisted data, appropriate authorization with proper authentication is mandatory. Authentication can be achieved in HDFS by integration with the LDAP server across all the nodes. LDAP is often used as a central repository for user information and as an authentication service. Organization/Company who has ingested huge data into Hadoop for analysis can define the security policy to avoid data theft, leak, misuse and ensure the right access to data inside HDFS directories, execute HIVE query, etc. User or team needs to get authenticated via the LDAP server before processing/query data from the cluster. LDAP integration with Hadoop can be done either by using OS-level configuration to read LDAP groups or explicitly configuring Hadoop to use LDAP-based group mapping.
  • Introducing Apache Knox gateway – Single access point with multi-node Hadoop clusters can be achieved by Apache Knox for all REST and HTTP interactions. With the complex configuration, the client-side library can be wiped out by using Apache Knox. Besides accessing data in the cluster, we can provide security for job execution in the cluster.
  • Kerberos for authentication – Kerberos network authentication protocol provides strong authentication for the 2-tier application (client and server). Kerberos server verifies identities for every request when the client wants to access the Hadoop cluster. Kerberos Database stores and controls all principles and realms. Kerberos uses secret-key cryptography to enhance strong authentication by providing user-to-server authentication. A Kerberos server, usually called Key Distribution Center (KDC) should be installed on one physical host and its database contains the user and service entries like user’s principal, maximum validity, maximum renewal time, password expiration, etc.
  • Apache Ranger for centralized and comprehensive data security – By Integrating Apache Ranger with a multi-node Hadoop cluster, many requirements mandatory for governance and security can be fulfilled. It has the capacity to manage all security-related tasks via centralized security administration in a central UI or using REST APIs. Besides, Apache Ranger can be utilized effectively to perform fine-grained authorization to do a specific action, standardize the authorization method across all Hadoop components. Apache Ranger has provided dynamic column masking as well as row-level data masking functionality with Ranger-specific policies to protect sensitive data from querying out from the HIVE table in real-time.

Basic Understanding of Stateful Data Streaming Supported by Apache Flink

The technologies related to the Apache Flink big data processing platform are enhancing its maturity in order to efficiently execute data streaming, which is becoming a major focal point for businesses as it allows them to quickly make decisions. 

Data Streaming

By leveraging data streaming applications, we can process/analyze the continuous flow of data without storing the data (i.e. the data stays in motion) to undercover any discrepancies, issues, errors, behavioral patterns, etc. that can help you make informed, data-driven decisions.

A Credible Approach of Big Data Processing and Analysis of Telecom Data

Telecom providers have a treasure trove of captive data — customer data, CDR (call detail records), call center interactions, tower logs, etc. and are metaphorically “sitting on a gold mine.” Ideally, each category of the generated data has the following information:

  • Customer data consolidate customer id, plan details, demographic, subscribed services, and spending patterns
  • Service data category consolidates types of customer, customer history, complain category, query resolved, etc. are on
  • Usually for the smart mobile phone subscriber, location category data consolidates GPS-based data, roaming data, current location, frequently visited location, etc.

Deep Learning at Alibaba Cloud With Alluxio – Running PyTorch on HDFS

Google’s TensorFlow and Facebook’s PyTorch are two Deep Learning frameworks that have been popular with the open source community. Although PyTorch is still a relatively new framework, many developers have successfully adopted it due to its ease of use.

By default, PyTorch does not support Deep Learning model training directly in HDFS, which brings challenges to users who store data sets in HDFS. These users need to either export HDFS data at the start of each training job or modify the source code of PyTorch to support reading from HDFS. Both approaches are not ideal because they require additional manual work that may introduce additional uncertainties to the training job.

MapReduce and Yarn: Hadoop Processing Unit Part 1

In my previous article, HDFS Architecture and Functionality, I’ve described the filesystem of Hadoop. Today, we will be learning about the processing unit of it. There are mainly two mechanisms by which processing takes place in a Hadoop cluster, namely, MapReduce and YARN. In our traditional system, the major focus is on bringing data to the storage unit. In the Hadoop process, the focus is shifted towards bringing the processing power to the data to initiate parallel processing. So, here, we will be going through MapReduce and, in part two, YARN.

Mapreduce

As the name suggests, processing mainly takes place in two steps, mapping and reducing. There is a single master (Job tracker) that controls ob execution on multiple slaves (Task tracker). The Job Tracker accepts MapReduce jobs submitted by the client. It pushes a map and reduce tasks out to Task Tracker and also monitors their status. Task trackers' major function is to run the map and reduce tasks. They also manage and store the intermediate output of the tasks.

The Complete Apache Spark Collection [Tutorials and Articles]

In this edition of "Best of DZone," we've compiled our best tutorials and articles on one of the most popular analytics engines for data processing, Apache Spark. Whether you're a beginner or are a long-time user, but have run into inevitable bottlenecks, we've got your back!

Before we begin, we'd like need to thank those who were a part of this article. DZone has and continues to be a community powered by contributors like you who are eager and passionate to share what they know with the rest of the world. 

Things to Know About Big Data Testing

Within a short span of time, data has emerged as one of the world’s most valuable resources. In an interconnected world, thanks to the technology revolution, huge amounts of data are generated each second. Big data is the collection of huge data that grow tremendously over time.

This data is so complex and large that traditional database management tools are unable to store and process it efficiently. Here is when big data testing comes into the picture. Let us discuss big data testing, its uses and the process of performing the test.

Four Different Ways to Write in Alluxio

Alluxio is an open-source data orchestration system for analytics and AI workloads. Distributed applications like Apache Spark or Apache Hive can access Alluxio through its HDFS-compatible interface without code change. We refer to external storage such as HDFS or S3 as under storage. Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.

Given an application such as a Spark job that saves its output to an external storage service; Writing the job output to the memory layer in a colocated Alluxio worker will achieve the best write performance. Due to the volatility of memory, when a node in Alluxio goes down or restarts, any data in that node’s memory is lost. To prevent data loss, Alluxio provides the ability to write the data to the persistent under storage either synchronously or asynchronously by configuring client-side Write Types. Each Write Type has benefits and drawbacks associated with it. Applications that write to Alluxio storage should consider the different write types and perform a cost-benefit analysis to determine the write type that is best suited for the application requirements.

Arm Twisting Apache NiFi

Introduction

Apache NiFi, is a software project from Apache Software Foundation, designed to automate the flow of data between software systems.

Early this year, I created a generic, meta-data driven data offloading framework using Talend. While championing that tool, many accounts raised concerns regarding the Talend license. While some were apprehensive of the additional cost, many others questioned the tool itself, due to the fact that their account already had licenses for other competitive ETL tools like DataStage and Informatica (to name a few). A few accounts also wanted to know if the same concept of offloading could be made available using NiFi. Therefore, it was most logical to explore NiFi.

Apache Flume and Data Pipelines

What Is Apache Flume?

Apache Flume is an efficient, distributed, reliable, and fault-tolerant data-ingestion tool. It facilitates the streaming of huge volumes of log files from various sources (like web servers) into the Hadoop Distributed File System (HDFS), distributed databases, such as HBase on HDFS, or even destinations like Elasticsearch at near-real time speeds. In addition to streaming log data, Flume can also stream event data generated from web sources like Twitter, Facebook, and Kafka Brokers.

The History of Apache Flume

Apache Flume was developed by Cloudera to provide a way to quickly and reliably stream large volumes of log files generated by web servers into Hadoop. There, applications can perform further analysis on data in a distributed environment. Initially, Apache Flume was developed to handle only log data. Later, it was equipped to handle event data as well.

Comparing Apache Hive and Spark

Introduction

Hive and Spark are two very popular and successful products for processing large-scale data sets. In other words, they do big data analytics. This article focuses on describing the history and various features of both products. A comparison of their capabilities will illustrate the various complex data processing problems these two products can address.

More on the subject:

The Practice of Alluxio in Ctrip Real-Time Computing Platform

Today, a real-time computation platform is becoming increasingly important in many organizations. In this article, we will describe how ctrip.com applies Alluxio to accelerate the Spark SQL real-time jobs and maintain the jobs’ consistency during the downtime of our internal data lake (HDFS). In addition, we leverage Alluxio as a caching layer to reduce the workload pressure on our HDFS NameNode.

Background and Architecture

Ctrip.com is the largest online travel booking website in China. It provides online travel services including hotel reservations, transportation ticketing, packaged tours, with hundreds of millions of online visits every day. Driven by the high demand, a massive amount of data is stored in big data platforms in different formats. Handling nearly 300,000 offline and real-time analytics jobs every day, our main Hadoop cluster is at the scale of a thousand servers, with more than 50PB of data stored and increasing by 400TB daily.

HDFS Architecture and Functioning

First of all, thank you for the overwhelming response to my previous article (Big Data and Hadoop: An Introduction). In my previous article, I gave a brief overview of Hadoop and its benefits. If you have not read it yet, please spend some time to get a glimpse into this rapidly growing technology. In this article, we will be taking a deep dive into the file system used by Hadoop called HDFS (Hadoop Distributed File System).

HDFS is the storage part of the Hadoop System. It is a block-structured file system where each file is divided into blocks of a predetermined size. These blocks are stored across a cluster of one or several machines. HDFS works with two types of nodes: NameNode (master) and DataNodes (slave). So let's dive.