Using Consistent Hashing in Presto to Improve Caching Data Locality in Dynamic Clusters

Running Presto with Alluxio is gaining popularity in the community. It avoids long latency reading data from remote storage by utilizing SSD or memory to cache hot datasets close to Presto workers. Presto supports hash-based soft affinity scheduling to enforce that only one or two copies of the same data are cached in the entire cluster, which improves cache efficiency by allowing more hot data cached locally. The current hashing algorithm used, however, does not work well when cluster size changes. This article introduces a new hashing algorithm for soft affinity scheduling, consistent hashing, to address this problem.

Soft Affinity Scheduling

Presto uses a scheduling strategy called soft affinity scheduling to schedule a split (smallest unit of data processing) to the same Presto worker (preferred node). The mapping from a split and a Presto worker is computed by a hashing function on the split, making sure the same split will always be hashed to the same worker. The first time a split is processed, data will be cached on the preferred worker node. When subsequent queries process the same split, these requests will be scheduled to the same worker node again. Since data is already cached locally, no remote read will be necessary.

How to Generate and Compare Perceptual Image Hashes in Java

Perceptual image hashing is a relatively new process used primarily in the multimedia industry for content identification and authentication. The process itself uses an algorithm to extract specific features from an image and calculate a hash value based on that information. The hash value that is generated acts as a kind of ‘fingerprint’ for the image; it is a distinct identifier that is unique to its parent image. 

As you may have guessed by the fingerprint comparison, perceptual image hashing is particularly useful for digital forensics, but it has become an important player in prohibiting online copyright infringement as well. By comparing the hash value of an original/authentic image with the hash value of a similar image, you can identify and match various images and calculate the Hamming Distance between them. For reference, Hamming Distance measures the minimum number of substitutions it takes to change one image to the other, so hash values that are closer together are more similar. 

What Is a Hash Table?

Why Should I Care?

Have you ever wanted to know:

  • How does the hash map, associative array, or dictionary data structure in your language work?
  • When is it appropriate to use a hash table to store items?
  • How do we deal with 'collisions' in a hash table?

In 5 Minutes or Less:

Imagine we want to store a list of users so that we can find them later using their names.

Hashing Names Does not Protect Privacy

Secure hash functions are practically impossible to reverse, but only if the input is unrestricted.

If you generate 256 random bits and apply a secure 256-bit hash algorithm, an attacker wanting to recover your input can’t do much better than brute force hashing 256-bit strings hoping to find one that matches your hash value. Even then, the attacker may find a collision, another string of bits that happens to have the same hash value.

Decoded: Examples of How Hashing Algorithms Work

If cryptography was a body, its hashing algorithm would be the heart of it. If cryptography was a car, its hashing algorithm would be its engine. If cryptography was a movie, its hashing algorithm would be the star. If cryptography was the solar system, its hashing algorithm would be the sun. Okay, that’s probably too far, but you’ve got the point, right? Before we get to the what hashing algorithm is, why it’s there, and how it works, it’s important to understand where its nuts and bolts are. Let’s start with hashing.

What Is Hashing?

Let’s try to imagine a hypothetical situation here. Suppose you want to send a message/file to someone and it is absolutely imperative that it reaches its intended recipient in the exact same format. How would you do it? One option is to send it multiple times and verify that it wasn’t tampered with. But what if the message is too long? What if the file measures in gigabytes? It would be utterly absurd, impractical, and, quite frankly, boring to verify every single letter, right? Well, that’s where hashing comes into play.