The Definitive Guide to Building a Data Mesh With Event Streams

Data mesh. This oft-talked-about architecture has no shortage of blog posts, conference talks, podcasts, and discussions. One thing that you may have found lacking is a concrete guide on precisely how to get started building your own data mesh implementation. We have you covered. In this blog post, we’ll show you how to build a data mesh using event streams, highlighting our design decisions, and the key benefits and challenges you’ll need to consider along the way. In fact, we’ll go one better: we’ve built a data mesh prototype for you to check out on your own to see what this would look like in action, or fork to bootstrap a data mesh for your own organization. 

Data mesh is technology agnostic so there are a few different ways you can go about building one. The canonical approach is to build the mesh using event streaming technology that provides a secure, governed, real-time mechanism for moving data between different points in the mesh. 

Streaming Data Exchange With Kafka and a Data Mesh in Motion

Data Mesh is a new architecture paradigm that gets a lot of buzz these days. Every data and platform vendor describes how to build the best Data Mesh with their platform. The Data Mesh story includes cloud providers like AWS, data analytics vendors like Databricks and Snowflake, and Event Streaming solutions like Confluent. This blog post looks into this principle deeper to explore why no single technology is the perfect fit to build a Data Mesh. Examples show why an open and scalable decentralized real-time platform like Apache Kafka is often the heart of the Data Mesh infrastructure, complemented by many other data platforms, to solve business problems.

Data at Rest vs. Data in Motion

Before we get into the Data Mesh discussion, it is crucial to clarify the difference and relevance of Data at Rest and Data in Motion:

Importance of Data Discovery in Data Mesh Architecture


Data Discovery

Data Mesh/Discovery — Panel Recap

Recently, I came across a great panel hosted by data mesh learning incorporation with the open-source data podcast — to discuss the significance of data discovery in data mesh architecture and other important issues surrounding data mesh delivery.

The panel consisted of expert solution architects, including Shinji Kim, CEO Select Star, Sophie Watson, Principal Data Scientist Red Hat, Mark Grover, Founder of Stemma, and Shirshanka Das, CEO Acryl Data. 

Data Lake and Data Mesh Use Cases

As data mesh advocates come to suggest that the data mesh should replace the monolithic, centralized data lake, I wanted to check in with Dipti Borkar, co-founder and Chief Product Officer at Ahana. Dipti has been a tremendous resource for me over the years as she has held leadership positions at Couchbase, Kinetica, and Alluxio.

Definitions

  • data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the resource format and in addition to the originating data stores.
  • data mesh is a type of data platform architecture that embraces the ubiquity of data in the enterprise by leveraging a domain-oriented, self-serve design. Mesh is an abstraction layer that sits atop data sources and provides access.
According to Dipti, while data lakes and data mesh both have use cases they work well for, data mesh can’t replace the data lake unless all data sources are created equal — and for many, that’s not the case. 

Data Sources

All data sources are not equal. There are different dimensions of data:
  • Amount of data being stored
  • Importance of the data
  • Type of data
  • Type of analysis to be supported
  • Longevity of the data being stored
  • Cost of managing and processing the data
Each data source has its purpose. Some are built for fast access for small amounts of data, some are meant for real transactions, some are meant for data that applications need, and some are meant for getting insights on large amounts of data. 

AWS S3

Things changed when AWS commoditized the storage layer with the AWS S3 object-store 15 years ago. Given the ubiquity and affordability of S3 and other cloud storage, companies are moving most of this data to cloud object stores and building data lakes, where it can be analyzed in many different ways.

Because of the low cost, enterprises can store all of their data — enterprise, third-party, IoT, and streaming — into an S3 data lake. However, the data cannot be processed there. You need engines on top like Hive, Presto, and Spark to process it. Hadoop tried to do this with limited success. Presto and Spark have solved the SQL in S3 query problem.

Data in Transition

Different enterprises are able to get their data into the data lake at different rates. Innovators are able to get their data into the data lake with a 30-minute lag-time, while laggards may take a week to land their data. This is where data mesh, or federated access, comes in.

Today, 5 to 10% of compute is on the mesh workload while 90 to 95% are SQL queries to the data lake. All data is eventually in the data lake; however, data that's still in transition is where the mesh workload lives. 
 
There are two different use cases for data lake and data mesh. If your primary goal is to be data-driven, then a data lake approach should be the primary focus. If it's important to analyze data in transition then augmenting a data lake with a data mesh would make sense.

While data mesh is great for data in motion, it does not eliminate the need for other data sources like RDBMS and Elasticsearch as they are serving different purposes for the applications they are supporting.