Optimizing Distributed Joins: Google Cloud Spanner and DataStax Astra DB

Distributed joins are commonly considered too expensive to use for real-time transaction processing. That is because, besides joining data, they also frequently require moving or shuffling data between nodes in a cluster, which can significantly affect query response times and database throughput. However, there are certain optimizations that can completely eliminate the need to move data to enable faster joins. In this article, we first review the four types of distributed joins, including shuffle join, broadcast join, co-located join, and pre-computed join. We then demonstrate how leading fully managed Relational and NoSQL databases, namely Google Cloud Spanner and DataStax Astra DB, support optimized joins that are suitable for real-time applications.

Four Types of Distributed Joins

Joins are used in databases to combine related data from one or more tables or datasets. Data is usually combined based on some condition that relates columns from participating tables. We call columns used in a join condition join keys and assume they are always related by equality operators.