Optimizing Distributed Joins: Google Cloud Spanner and DataStax Astra DB

Distributed joins are commonly considered too expensive to use for real-time transaction processing. That is because, besides joining data, they also frequently require moving or shuffling data between nodes in a cluster, which can significantly affect query response times and database throughput. However, there are certain optimizations that can completely eliminate the need to move data to enable faster joins. In this article, we first review the four types of distributed joins, including shuffle join, broadcast join, co-located join, and pre-computed join. We then demonstrate how leading fully managed Relational and NoSQL databases, namely Google Cloud Spanner and DataStax Astra DB, support optimized joins that are suitable for real-time applications.

Four Types of Distributed Joins

Joins are used in databases to combine related data from one or more tables or datasets. Data is usually combined based on some condition that relates columns from participating tables. We call columns used in a join condition join keys and assume they are always related by equality operators.

Reaper 3.0 for Apache Cassandra Is Available

The K8ssandra team is pleased to announce the release of Reaper 3.1. Let’s dive into the features and improvements that 3.0 recently introduced (along with some notable removals) and how the newest update to 3.1 builds on that.

JDK11 Support

Starting with 3.1.0, Reaper can now compile and run with jdk11. Note that jdk8 is still supported at runtime.

The End of the Beginning for Apache Cassandra

Editor’s note: This story originally ran on July 27, 2021, the day that Apache Cassandra 4.0 was released.

Today is a big day for those of us in the Apache Cassandra community. After a long uphill climb, Apache Cassandra 4.0 has finally shipped. I say finally because it has at times seemed like an elusive goal. I’ve been involved in the Cassandra project for almost 10 years now and I have seen a lot of ups and downs. So I feel this day marks an important milestone that isn’t just a version number. This is an important milestone in the lifecycle of a database project that has come into its own as an important database used around the world. The 4.0 release is not only incredibly stable in the history of Cassandra, but it’s also quite possibly the most stable release of any database. Now it’s ready to launch into the next 10 years of cloud-native data; it has the computer science and hard-won history to make a huge impact. Today’s milestone is the end of the beginning.

Taking Your Database Beyond a Single Kubernetes Cluster

Global applications need a data layer that is as distributed as the users they serve. Apache Cassandra has risen to this challenge, handling data needs for the likes of Apple, Netflix, and Sony. Traditionally, managing data layers for a distributed application was handled with dedicated teams to manage the deployment and operations of thousands of nodes — both on-premises and in the cloud.

To alleviate much of the load felt by DevOps teams, we evolved a number of these practices and patterns in K8ssandra, leveraging the common control plane afforded by Kubernetes (K8s) There has been a catch though — running a database (or indeed any application) across multiple regions or K8s clusters is tricky without proper care and planning up front.

Kubernetes and Apache Cassandra: What Works (and What Doesn’t)

“I need it now and I need it reliable.”

– ANYONE WHO HASN’T DEPLOYED APPLICATION INFRASTRUCTURE

If you’re on the receiving end of this statement, we understand you in the K8ssandra community. Although we do have reason for hope ⁠— recent surveys have shown that Kubernetes (K8s) is growing in popularity, not only because it’s powerful technology, but because it actually delivers on reducing the toil of deployment.

Multi-Cluster Cassandra Deployment With Google Kubernetes Engine (Pt. 2)

This is the second in a series of posts examining patterns for using K8ssandra to create Cassandra clusters with different deployment topologies.

In the first article in this series, we looked at how you could create a Cassandra cluster with two datacenters in a single cloud region, using separate Kubernetes namespaces in order to isolate workloads. For example, you might want to create a secondary Cassandra datacenter to isolate a read-heavy analytics workload from the datacenter supporting your main application.

How to Put a Database in Kubernetes

The idea of running a stateful workload in Kubernetes (K8s) can be intimidating, especially if you haven’t done it before. How do you deploy a database? Where is the actual storage? How is the storage mapped to the database or the application using it?

At KubeCon North America 2021, I gave a talk on “How to put a database in Kubernetes” where I demystified the deployment of databases and stateful workloads in K8s. Basically, it boils down to a few key steps:

Deploy a Multi-Datacenter Apache Cassandra Cluster in Kubernetes (Pt. 1)

The Get Started examples on the K8ssandra site are primarily concerned with spinning up a single Apache Cassandra™ datacenter in a single Kubernetes cluster. However, there are many situations that can benefit from other deployment options. In this series of posts, we’ll examine different deployment patterns and show how to implement them using K8ssandra.

Flexible Topologies With Cassandra

From its earliest days, Cassandra has included the ability to assign nodes to datacenters and racks. A rack was originally conceived as mapping to a single rack of servers connected to shared resources, like power, network, and cooling. A datacenter could consist of multiple racks with physical separation. These constructs allowed developers to create high-availability deployments by replicating data across different fault domains. This ensured that Cassandra clusters remain operational amid failures ranging from a single physical server, rack, to an entire datacenter facility. 

Why We Decided to Build a K8ssandra Operator – Part 4

In the firstsecond, and third posts in this series, we’ve shared conversations with K8ssandra core team members on our journey to build a Kubernetes operator for K8ssandra. We’ve discussed the virtues of the Helm package manager versus Kubernetes operators for deploying and managing infrastructure in Kubernetes and some of our implementation choices for the operator.

In this final post of the series, we pick up from the previous post with a discussion of how we decided to structure our projects in GitHub, how we test the K8ssandra operator, and our hopes for how the operator will expand the K8ssandra developer community.

Backing Up K8ssandra With MinIO

K8ssandra includes Medusa for Apache Cassandra® to handle backup and restore for your Cassandra nodes. Recently Medusa was upgraded to introduce support for all S3 compatible backends, including MinIO, the popular k8s-native object storage suite. Let’s see how to set up K8ssandra and MinIO to backup Cassandra in just a few steps.

Deploy MinIO

Similar to K8ssandra, MinIO can be simply deployed through Helm.

The Future of Cloud-Native Databases Begins With Apache Cassandra 4.0

“Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust.” 

This was the first line of the highly impactful paper titled “Dynamo: Amazon’s Highly Available Key-Value Store.” Published in 2007, it was written at a time when the status quo of database systems was not working for the massive explosion of internet-based applications. A team of computer engineers and scientists at Amazon completely re-thought the idea of data storage in terms of what would be needed for the future, with a firm footing in the computer science of the past. 

Requirements for Running K8ssandra for Development

K8ssandra is a complete stack for running Apache Cassandra® in production. As such, it comes with several components that can consume a lot of resources and make it challenging to run on a dev laptop. Let’s explore how we can configure K8ssandra for this environment and run some simple benchmarks to determine what performance we can expect.

Managing Expectations

The K8ssandra Quickstart is an excellent guide for doing a full installation of K8ssandra on a dev laptop and trying out the various components of the K8ssandra stack. While this is a great way to get your first hands-on experience with K8ssandra, let’s state the obvious: running K8ssandra locally on a dev laptop is not aimed at performance. In this blog post, we will start Apache Cassandra® locally then explain how to run benchmarks to help evaluate what level of performance (especially throughput) you can expect from a dev laptop deployment.  

Survey Finds Data on Kubernetes Is No Longer a Pipe Dream

For people that work in infrastructure and application development, the pace of change is quick. Finish one project and it’s on to the next. Each iteration requires an evaluation asking if the right technology is being used and if it provides a new advantage. Kubernetes has been on the fast track of continuous evaluation. New projects and methodologies are continuously emerging and it can be hard to keep up. Then there is the question of running stateful services. 

The Data on Kubernetes community has released a report titled “Data on Kubernetes 2021” to give us a snapshot of where our industry sits with stateful workloads. Over 500 executives and tech leaders were asked some very direct and insightful questions about how they use Kubernetes. It turns out that there were a lot of surprising finds. Some that I would have never predicted. Let’s dig into some of the highlights that stood out to me. 

MySQL Tutorial: Understanding The Seconds Behind Master Value

In a MySQL hosting replication setup, the parameter Seconds_Behind_Master (SBM), as displayed by the SHOW SLAVE STATUS command, is commonly used as an indication of the current replication lag of the slave. In this post, we examine how to understand and interpret this value in various situations.

Possible Values of Seconds Behind Master

The value of SBM, as explained in the MySQL documentation, depends on the state of the MySQL slave in general, and the states of MySQL slave SQL_THREAD and IO_THREAD in particular. While IO_THREAD connects with the master and reads the updates, SQL_THREAD applies these updates on the slave. Let’s examine the possible values of SBM during different states of the MySQL Slave.