From Zookeeper to Raft: HA and Fault-Tolerance

Introduction

Alluxio implements a virtual distributed file system that allows accessing independent large data stores with compute engines like Hadoop or Spark through a single interface. Within an Alluxio cluster, the Alluxio master is responsible for coordinating and keeping access to the underlying data stores or filesystems (UFS for short) consistent. The master contains a global snapshot of filesystem metadata, and when a client wants to read or modify a file, it first contacts the master. Given its central role in the system, the master must be fault-tolerant, highly available, strongly consistent, and fast. This blog will discuss the evolution of the Alluxio master from a complex multi-component system using Zookeeper to a simpler and more efficient one using Raft.

The operation of a file system can be thought of as a sequence of commands performed on the system (e.g., create/delete/read/write). Executing these commands one at a time in a single thread gives us a sequential specification for our file system that is easy to reason about and to implement applications on top of. While this sequential specification is simple, the actual implementation of Alluxio’s virtual distributed file system consists of many moving parts all executing concurrently: the master who coordinates the system, the workers who store the file data and act as a cache between the client and the UFS, the UFSs themselves, and the clients who, at the direction of the master, access all other parts of the system. Again here, we may want to reason about each of these individual components as a sequential thread executing one operation at a time, but in reality, they are running complex systems themselves.