How Do CRDTs Solve Distributed Data Consistency Challenges?

This is an article about the complexity of maintaining data consistency in distributed environments. It introduces conflict-free replicated data types (CRDTs) as a way to resolve concurrent data changes.

Common Data Consistency Challenges

Consider a situation where there are several distributed entities that each hold a copy of the same data. Data consistency is maintained if those copies continue to match each other, even when one or more of them are updated.

HarperDB: More Than a Database

Introduction

I recently had a very interesting conversation on our podcast with Ron Lewis, the Director of Innovation and Engineering at Lumen Technologies. Ron brought up the notion that HarperDB is more than just a database, and for certain users or projects, HarperDB is not serving as a database at all. How can this be possible?

Database, Explained

Well, what really is a database? Wikipedia states “In computing, a database is an organized collection of data stored and accessed electronically from a computer system.” Another site simply states that “A database is a systematic collection of data. They support electronic storage and manipulation of data. Databases make data management easy.”

Industries That Need a High Performing Low Latency Distributed Database

There are certain industries that greatly benefit from high-performing, low-latency, geo-distributed technologies, while other organizations might be more focused on vertically scaling architectures. This is dependent on numerous factors including the data pipeline, network, data structure, type of product or solution, short and long-term goals, etc. While there are currently many databases and tools that provide vertical scaling capabilities, there are not many that focus on horizontal scaling -- but there’s still a need for both.

Latency

Before jumping into specific industries that benefit from high-performing, low-latency, geo-distributed databases (it’s a mouthful, I know), let’s define a few terms here. High-performing is pretty self-explanatory so I’ll skip over that one. For the next term, I’ll refer to my colleague Jacob Cohen’s blog on Geo-Distributed Databases. Latency generally measures the duration between an action and a response. In user-facing applications, that can be narrowed down to the delay between when a user makes a request and when the application responds to a request. So, technologies that enable low latency usually improve performance and response times, leading to improved user experience and cost savings.

Scala Futures: Concurrency Interpreted!

Futures allow us to run values off the main thread and handle values that are running in the background or yet to be executed by mapping them with callbacks.

If you come from a Java background, you might be aware of java.util.concurrent.Future. There are several challenges in using this:

Distributed Balanced Partition-Queues Assignment Using Kubernetes statefulSet

Partitioning a domain is a useful way to achieve scalability of a system. The idea behind partitioning is that instead of putting everything in a single place, you divide the dataset or the work into multiple places based on some attribute, which is usually the identifier of the entity.

The division allows us to spread the storage and/or the processing to multiple machines or containers, and allows the horizontal scaling we were seeking to get.

Smart Pipes and Smart Endpoints With Service Mesh

Microservices communicate significantly over the network. As the number of services grows in your architecture, the risks due to an unreliable network grows too. Handling the service to service communication within a microservices architecture is challenging. Hence the recommended solution has been to build services that have dumb pipes and smart endpoints.

The first fallacy from the comprehensive list of ' Eight Fallacies of Distributed Computing ' is that the 'Network is reliable.'

Understanding the CAP Theorem

The CAP theorem is a tool used to makes system designers aware of the trade-offs while designing networked shared-data systems. CAP has influenced the design of many distributed data systems. It made designers aware of a wide range of tradeoffs to consider while designing distributed data systems. Over the years, the CAP theorem has been a widely misunderstood tool used to categorize databases. There is much misinformation floating around about CAP. Most blog posts on CAP are historical and possibly incorrect.

It is important to understand CAP so that you can identify the misinformation around it.