From Zookeeper to Raft: HA and Fault-Tolerance

Introduction

Alluxio implements a virtual distributed file system that allows accessing independent large data stores with compute engines like Hadoop or Spark through a single interface. Within an Alluxio cluster, the Alluxio master is responsible for coordinating and keeping access to the underlying data stores or filesystems (UFS for short) consistent. The master contains a global snapshot of filesystem metadata, and when a client wants to read or modify a file, it first contacts the master. Given its central role in the system, the master must be fault-tolerant, highly available, strongly consistent, and fast. This blog will discuss the evolution of the Alluxio master from a complex multi-component system using Zookeeper to a simpler and more efficient one using Raft.

The operation of a file system can be thought of as a sequence of commands performed on the system (e.g., create/delete/read/write). Executing these commands one at a time in a single thread gives us a sequential specification for our file system that is easy to reason about and to implement applications on top of. While this sequential specification is simple, the actual implementation of Alluxio’s virtual distributed file system consists of many moving parts all executing concurrently: the master who coordinates the system, the workers who store the file data and act as a cache between the client and the UFS, the UFSs themselves, and the clients who, at the direction of the master, access all other parts of the system. Again here, we may want to reason about each of these individual components as a sequential thread executing one operation at a time, but in reality, they are running complex systems themselves. 

A Tentative Comparison of Fault Tolerance Libraries on the JVM

If you're implementing microservices or not, the chances are that you're calling HTTP endpoints. With HTTP calls, a lot of things can go wrong. Experienced developers plan for this and design beyond just the happy path. In general, fault tolerance encompasses the following features:

  • Retry
  • Timeout
  • Circuit Breaker
  • Fallback
  • Rate Limiter to avoid server-side 429 responses
  • Bulkhead: Rate Limiter limits the number of calls in a determined timeframe, while Bulkhead limits the number of concurrent calls

A couple of libraries implement these features on the JVM. In this post, we will look at Microprofile Fault Tolerance, Failsafe, and Resilience4J.

How To Automate PostgreSQL and repmgr on Vagrant

I often get asked if it's possible to build a resilient system with PostgreSQL.

Considering that resilience should feature cluster high-availability, fault tolerance, and self-healing, it's not an easy answer. But there is a lot to be said about this.

Consistency Through Compensation in Microservices

This article addresses the eventual consistency aspect of transactions in a microservices environment where a transaction spans more than one microservice and where transaction failure midway through is imminent.

Existing Business Use Case

Suppose that we currently have a monolithic order management system that is backed by an RDBMS.

Redis Reconnection Resiliency

Background

It's a world of microservices. Such applications or microservices are required to store data temporarily with frequent and super quick access to avoid disk IO operations using Redis-like in-memory databases. These applications have multiple in-memory database clusters to handle huge amounts of traffic and to avoid request failures. To access this data quickly, applications are required to have the preconfigured, established pooled connections ready for service from the applications.Image title

Problem Statement

Applications built for resiliency have backup options in case of application or infrastructure failures. In-memory database clusters that exist in different data centers on different servers allow for backup connectivity in case of data center or server issues.

Applying Bulkheads and Backpressure Using MicroProfile (Video)

I’ve recorded a video on how to implement bulkheads and backpressure using MicroProfile Fault Tolerance. The idea behind bulkheads is to split applications into several execution units that isolate functionality. In enterprise Java applications, this typically means defining multiple thread pools.

Applying backpressure to clients results in either adding information about the current pressure on the system to the client so that they will react to it, or to explicitly deny the request with a temporary error response.