Data Lake and Data Mesh Use Cases

As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.

Definitions

A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the resource format and in addition to the originating data stores.
A data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Mesh is an abstraction layer that sits atop data sources and provides access.

According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case.

Data Sources

All data sources are not equal. There are different dimensions of data:

Amount of data being stored
Importance of the data
Type of data
Type of analysis to be supported
Longevity of the data being stored
Cost of managing and processing the data

Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data.

AWS S3

Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.

Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.

Data in Transition

Different enterprises are able to get their data into the data lake at different rates. Innovators are able to get their data into the data lake with a 30-minute lag-time, while laggards may take a week to land their data. This is where data mesh, or federated access, comes in.

Today, 5 to 10% of compute is on the mesh workload while 90 to 95% are SQL queries to the data lake. All data is eventually in the data lake; however, data that's still in transition is where the mesh workload lives.

There are two different use cases for data lake and data mesh. If your primary goal is to be data-driven, then a data lake approach should be the primary focus. If it's important to analyze data in transition then augmenting a data lake with a data mesh would make sense.

While data mesh is great for data in motion, it does not eliminate the need for other data sources like RDBMS and Elasticsearch as they are serving different purposes for the applications they are supporting.