Avoid Data Silos in Presto in Meta: The Journey From Raptor to RaptorX

Raptor is a Presto connector (presto-raptor) used to power some critical interactive query workloads in Meta (previously Facebook). Though referred to in the ICDE 2019 paper Presto: SQL on Everything, it remains somewhat mysterious to many Presto users because there is no available documentation for this feature. This article will shed some light on the history of Raptor and why Meta eventually replaced it in favor of a new architecture based on local caching, namely RaptorX.

The Story of Raptor

Generally speaking, Presto, as a query engine, does not own storage. Instead, connectors were developed to query different external data sources. This framework is very flexible, but it is hard to offer low latency guarantees in disaggregated compute and storage architectures. Network and storage latency adds difficulty in avoiding variability. To address this limitation, Raptor was designed as a shared-nothing storage engine for Presto.

Using Consistent Hashing in Presto to Improve Caching Data Locality in Dynamic Clusters

Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot datasets close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.

Soft Affinity Scheduling

Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.