data locality | The Blog Pros

Using Consistent Hashing in Presto to Improve Caching Data Locality in Dynamic Clusters

Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot datasets close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.

Soft Affinity Scheduling

Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.

January 11, 2019July 10, 2019

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices. This article provides the top 10 tips for performance tuning for real-world workloads when running Spark on Alluxio with data locality, giving the most bang for the buck.

A Note on Data Locality

High data locality can greatly improve the performance of Spark jobs. When data locality is achieved, Spark tasks can read in-Alluxio data from local Alluxio workers at memory speed (when ramdisk is configured) instead of transferring the data over the network. The first few tips are related to locality.

GBase 8a Implementation Guide: Resource Assessment
No categories
1. Disk Storage Space Evaluation The storage space requirements for a GBase cluster are calculated based on the data volume of the business system, the choice of compression algorithm, and the number of cluster replicas. The data volume of a business s... […]
A Look Into Netflix System Architecture
No categories
Ever wondered how Netflix keeps you glued to your screen with uninterrupted streaming bliss? Netflix Architecture is responsible for the smooth streaming experience that attracts viewers worldwide behind the scenes. Netflix's system architecture emphas... […]
High Availability and Disaster Recovery (HADR) in SQL Server on AWS
No categories
High Availability and Disaster Recovery (HADR) play a vital role in maintaining the integrity of data, reducing downtime, and safeguarding against data loss in enterprise database systems. AWS offers a range of HADR options for SQL Server, which levera... […]
Terraform Tips for Efficient Infrastructure Management
No categories
Terraform is a popular tool for defining and provisioning infrastructure as code (IaC), improving consistency, repeatability, and version control. But you need to know how to use it properly to extract maximum value from it as an infrastructure managem... […]
Integration Testing With Keycloak, Spring Security, Spring Boot, and Spock Framework
No categories
In today's security landscape, OAuth2 has become a standard for securing APIs, providing a more robust and flexible approach than basic authentication. My journey into this domain began with a critical solution architecture decision: migrating from bas... […]

Proudly powered by WordPress