data lake | The Blog Pros

May 26, 2022

Data Lakes, Warehouses and Lakehouses. Which is Best?

Twenty years ago, your data warehouse probably wouldn’t have been voted hottest technology on the block. These bastions of the office basement were long associated with siloed data workflows, on-premises computing clusters, and a limited set of business-related tasks (i.e., processing payroll, and storing internal documents).

Now, with the rise of data-driven analytics, cross-functional data teams, and most importantly, the cloud, the phrase “cloud data warehouse” is nearly analogous to agility and innovation.

February 28, 2022

When NOT To Use Apache Kafka

Apache Kafka is the de facto standard for event streaming to process data in motion. With its significant adoption growth across all industries, I get a very valid question every week: When do I NOT use Apache Kafka? What limitations does the event streaming platform have? When does Kafka simply not provide the needed capabilities? How do I qualify Kafka out as not the right tool for the job?

This blog post explores the DOs and DONTs. Separate sections explain when to use Kafka, when NOT to use Kafka, and when to MAYBE use Kafka.

October 21, 2021

Data Fabric: What Is It and Why Do You Need It?

Insight-driven businesses have the edge over others; they grow at an average of more than 30% annually. Noting this pattern, modern enterprises are trying to become data-driven organizations and get more business value out of their data. But the rise of the cloud, the emergence of the Internet of Things (IoT), and other factors mean that data is not limited to on-premises environments. In addition, there are voluminous amounts of data, many data types, and multiple storage locations. As a consequence, managing data is getting more difficult than ever.

One of the ways organizations are addressing these data management challenges is by implementing a data fabric. Using a data fabric is a viable strategy to help companies overcome the barriers that previously made it hard to access data and process it in a distributed data environment. It empowers organizations to manage mounting amounts of data with more efficiency. Data fabric is one of the more recent additions to the lexicon of data analytics.

October 21, 2021

Data Fabric vs. Data Lake: Operational Comparison

This article will focus on which is the most appropriate big data store for high-scale, real-time, operational use cases – data fabric vs data lake. It will also discuss data warehouses, as well as relational, and non-relational, databases.

What Are Operational Use Cases?

Data-intensive enterprises are driven by a broad array of real-time use cases requiring a high-scale, high-speed data architecture that can support millions of concurrent transactions. Examples include:

October 19, 2021

Here’s How You Can Purge Big Data From Unstructured Data Lakes

Without a doubt, big data is becoming the biggest data with the passage of time. It’s going above and beyond. Here are some pieces of evidence as to why I said it.

According to the big data and business analytics report from Statista, the global cloud data IP traffic will reach approximately 19.5 zettabytes in 2021. Moreover, the big data market will strike a figure of 274.3 billion US dollars with a five-year compound annual growth rate (CAGR) of 13.2% by 2022. Plus, Forbes predicted that over 150 trillion gigabytes or 150 zettabytes of real-time data will be required by the year 2025. Also, Forbes found that more than 95% of the companies need some assistance for the management of unstructured data, while 40% of the organizations affirmed that they need to deal with big data more habitually.

September 10, 2021

Building a High-Performance Data Lake Using Apache Hudi and Alluxio at T3Go

T3Go is China’s first platform for smart travel based on the Internet of Vehicles. In this article, Trevor Zhang and Vino Yang from T3Go describe the evolution of their data lake architecture, built on cloud-native or open-source technologies including Alibaba OSS, Apache Hudi, and Alluxio. Today, their data lake stores petabytes of data, supporting hundreds of pipelines and tens of thousands of tasks daily. It is essential for business units at T3Go including Data Warehouse, Internet of Vehicles, Order Dispatching, Machine Learning, and self-service query analysis.

In this blog, you will see how we slashed data ingestion time by half using Hudi and Alluxio. Furthermore, data analysts using Presto, Hudi, and Alluxio saw the queries speed up by 10 times. We built our data lake based on data orchestration for multiple stages of our data pipeline, including ingestion and analytics.

August 9, 2021

Serverless Kafka in a Cloud-Native Data Lake Architecture

Apache Kafka became the de facto standard for processing data in motion. Kafka is open, flexible, and scalable. Unfortunately, the latter makes operations a challenge for many teams. Ideally, teams can use a serverless Kafka SaaS offering to focus on business logic. However, hybrid scenarios require a cloud-native platform that provides automated and elastic tooling to reduce the operations burden. This blog post explores how to leverage cloud-native and serverless Kafka offerings in a hybrid cloud architecture. We start from the perspective of data at rest with a data lake and explore its relation to data in motion with Kafka.

Data at Rest - Still the Right Approach?

Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases - even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.

July 29, 2021August 2, 2021

How to Run SQL Queries With Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show you how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Presto’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

July 13, 2021

Hands-on Presto Tutorial: Presto 101

In this blog we'll show you how to get started with Presto, the open source SQL query engine for the data lake. By the end you'll be able to run Presto locally on your machine.

Presto Installation

Presto can be installed manually or using docker images on:

June 25, 2021

How Carbon Uses PrestoDB With Ahana to Power Real-Time Customer Dashboards

The author, Jordan Hoggart, was not compensated by Ahana for this review.

The Background

At the base of Carbon’s real-time, first-party data platform is our analytics component, which combines a range of behavioral, contextual, and revenue data, which is then displayed within a dashboard in a series of charts, graphs, and breakdowns to give a visual representation of the most important actionable data. Whilst we pre-calculate as much of the information as possible, there are different filters that allow users to drill deeper into the data, which makes querying critical.

June 23, 2021

Data Lake and Data Mesh Use Cases

As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.

Definitions

A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the resource format and in addition to the originating data stores.
A data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Mesh is an abstraction layer that sits atop data sources and provides access.

According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case.

Data Sources

All data sources are not equal. There are different dimensions of data:

Amount of data being stored
Importance of the data
Type of data
Type of analysis to be supported
Longevity of the data being stored
Cost of managing and processing the data

Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data.

AWS S3

Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.

Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.

Data in Transition

Different enterprises are able to get their data into the data lake at different rates. Innovators are able to get their data into the data lake with a 30-minute lag-time, while laggards may take a week to land their data. This is where data mesh, or federated access, comes in.

Today, 5 to 10% of compute is on the mesh workload while 90 to 95% are SQL queries to the data lake. All data is eventually in the data lake; however, data that's still in transition is where the mesh workload lives.

There are two different use cases for data lake and data mesh. If your primary goal is to be data-driven, then a data lake approach should be the primary focus. If it's important to analyze data in transition then augmenting a data lake with a data mesh would make sense.

While data mesh is great for data in motion, it does not eliminate the need for other data sources like RDBMS and Elasticsearch as they are serving different purposes for the applications they are supporting.

June 22, 2021

When Small Files Crush Big Data — How to Manage Small Files in Your Data Lake

Big Data faces an ironic small file problem that hampers productivity and wastes valuable resources.

If not managed well, it slows down the performance of your data systems and leaves you with stale analytics. This kind of defeats the purpose, doesn’t it? HDFS stores small files inefficiently, leading to inefficient Namenode memory utilization and RPC calls, block-scanning throughput degradation, and reduced application layer performance. If you are a big data administrator on any modern data lake, you will invariably come face to face with the problem of small files. Distributed file systems are great but let's face it, the more you split storage layers the greater your overhead is when reading those files. So the idea is to optimize the file size to best serve your use case, while also actively optimizing your data lake.

March 25, 2021

Data Lake Architecture

With the rapid advancement in technologies, companies are now in search of a better way to ensure that organizational data and information are kept safe and organized. One way through which businesses are doing this is through the use of Data Lakes to create a centralized place management infrastructure that allows every organization to manage, store, analyze and classify data.

The concept of Data lake architecture has recently become a hot topic. These days, businesses use data to define their internal business objectives and metrics. Data Lakes offer agile analytics to measure you are continually evolving business. Data lakes really became the cornerstones of modern big data architecture

February 19, 2021

Geo-Distributed Data Lakes Explained

Geo-Distributed Data Lake is quite the mouthful. It’s a pretty interesting topic and I think you will agree after finishing this breakdown. There is a lot to say about how awesome it is to combine the flexibility of a data lake with the power of a distributed architecture, but I’ll get more into the benefits of both as a joint solution later. To start, I want to look at geo-distributed data lakes in two parts before we marry them together, for my non-developer brain that made the most sense! No time to waste, let’s kick things off with the one and only… data lakes.

It’s a Data LAKE, Not Warehouse!

It shouldn’t be a shock to the system to point out that we are living in a data-driven world going into 2021. Because of this, 'data lakes' are a fitting term for the amount of data companies are collecting. In my opinion, we could probably start calling them data oceans, expansive and seemingly never-ending. So what is a data lake exactly?

January 25, 2021

Data Lakes: All You Need to Know

Every organization in the modern world relies on data to capitalize on opportunities, solve business challenges, and make informed decisions for their business. However, with the increasing volume, variety, and velocity of data generated, companies in every industry are continuously seeking innovative solutions for storing, processing, and managing their data. Various technologies are being developed to support the big data revolution and address common challenges in data management.

One such burgeoning technology, and a buzzword in today’s world where data is the ultimate foundation of business, is data lakes. This article provides more details about data lakes by explaining what they are and their importance. You will also learn more about the data lake architecture and the key best practices for deploying data lakes.

May 22, 2020

Databricks Delta Lake Using Java

Delta Lake is an open source release by Databricks that provides a transactional storage layer on top of data lakes. In real-time systems, a data lake can be an Amazon S3, Azure Data Lake Store/Azure Blob storage, Google Cloud Storage, or Hadoop Distributed file system.

Delta Lake acts as a storage layer that sits on top of Data Lake. Delta Lake brings additional features where Data Lake cannot provide.

March 17, 2020

Unleashing Data With Data Lakes [Webinar Sign-up]

Today, most organizations recognize that their data is a valuable resource. Yet, in most cases, they fail to realize much of the potential value hidden therein. The inherent value of data is mostly hidden from view until it is activated by the presence of other synergistic data. Two datasets with little apparent utility can often be combined to form a unique value proposition. This data alchemy can produce new insights, but can also enable completely new businesses or industries. But when we try to apply traditional data processing techniques to unlock the value of data, we not only fail to glean the value expected, but we often inhibit the very process by which the value is discovered.

Data Lake is an architecture and methodology for the continuous extraction of value from complex and diverse data resources, which has enabled businesses to continue to extract new value, even as earlier insights are put into production.

November 25, 2019

Colocation vs. In-House Data Center: What’s Better for Your Business?

Business storage and management are integral to the daily operations of a business. As your business grows, the question of data storage must be addressed. According to a recent Gartner report, the worldwide cloud services market is projected to grow by nearly 18 percent in 2019, totaling $214.3 billion.

Despite the growing popularity of public cloud services, there are numerous convenient, affordable, and safe ways to store enterprise data, such as colocation and in-house data centers. Depending on the specific needs of your business, you may choose to prioritize factors like data control and overhead costs over convenience or vice versa. So, what is colocation, and how does it compare to in-house data centers?