How to Put a Database in Kubernetes

The idea of running a stateful workload in Kubernetes (K8s) can be intimidating, especially if you haven’t done it before. How do you deploy a database? Where is the actual storage? How is the storage mapped to the database or the application using it?

At KubeCon North America 2021, I gave a talk on “How to put a database in Kubernetes” where I demystified the deployment of databases and stateful workloads in K8s. Basically, it boils down to a few key steps:

Deploy a Multi-Datacenter Apache Cassandra Cluster in Kubernetes (Pt. 1)

The Get Started examples on the K8ssandra site are primarily concerned with spinning up a single Apache Cassandra™ datacenter in a single Kubernetes cluster. However, there are many situations that can benefit from other deployment options. In this series of posts, we’ll examine different deployment patterns and show how to implement them using K8ssandra.

Flexible Topologies With Cassandra

From its earliest days, Cassandra has included the ability to assign nodes to datacenters and racks. A rack was originally conceived as mapping to a single rack of servers connected to shared resources, like power, network, and cooling. A datacenter could consist of multiple racks with physical separation. These constructs allowed developers to create high-availability deployments by replicating data across different fault domains. This ensured that Cassandra clusters remain operational amid failures ranging from a single physical server, rack, to an entire datacenter facility. 

Why We Decided to Build a K8ssandra Operator – Part 4

In the firstsecond, and third posts in this series, we’ve shared conversations with K8ssandra core team members on our journey to build a Kubernetes operator for K8ssandra. We’ve discussed the virtues of the Helm package manager versus Kubernetes operators for deploying and managing infrastructure in Kubernetes and some of our implementation choices for the operator.

In this final post of the series, we pick up from the previous post with a discussion of how we decided to structure our projects in GitHub, how we test the K8ssandra operator, and our hopes for how the operator will expand the K8ssandra developer community.

Developing an Enterprise-Level Apache Cassandra Sink Connector for Apache Pulsar

When DataStax started investing in streaming with Apache Pulsar™, we knew that one of the first things people would want to do was connect existing enterprise data sources to Apache Cassandra™ using Pulsar.

Apache Pulsar has a powerful framework called Pulsar IO to enable this kind of use case, and at DataStax we already had a best-in-class Kafka Connect Sink that enables you to store structured data coming from one or more Kafka topics into DataStax Enterprise, Apache Cassandra, and Astra.

What CTOs Say vs. What Their Developers Hear

Anyone who’s been in a rapidly scaling company with an ever-expanding engineering team knows that communication is never as simple as it seems. 

That’s why we were so excited when Shankar Ramaswamy decided to sit down with Dev Interrupted.

Backing Up K8ssandra With MinIO

K8ssandra includes Medusa for Apache Cassandra® to handle backup and restore for your Cassandra nodes. Recently Medusa was upgraded to introduce support for all S3 compatible backends, including MinIO, the popular k8s-native object storage suite. Let’s see how to set up K8ssandra and MinIO to backup Cassandra in just a few steps.

Deploy MinIO

Similar to K8ssandra, MinIO can be simply deployed through Helm.

The Future of Cloud-Native Databases Begins With Apache Cassandra 4.0

“Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust.” 

This was the first line of the highly impactful paper titled “Dynamo: Amazon’s Highly Available Key-Value Store.” Published in 2007, it was written at a time when the status quo of database systems was not working for the massive explosion of internet-based applications. A team of computer engineers and scientists at Amazon completely re-thought the idea of data storage in terms of what would be needed for the future, with a firm footing in the computer science of the past. 

Requirements for Running K8ssandra for Development

K8ssandra is a complete stack for running Apache Cassandra® in production. As such, it comes with several components that can consume a lot of resources and make it challenging to run on a dev laptop. Let’s explore how we can configure K8ssandra for this environment and run some simple benchmarks to determine what performance we can expect.

Managing Expectations

The K8ssandra Quickstart is an excellent guide for doing a full installation of K8ssandra on a dev laptop and trying out the various components of the K8ssandra stack. While this is a great way to get your first hands-on experience with K8ssandra, let’s state the obvious: running K8ssandra locally on a dev laptop is not aimed at performance. In this blog post, we will start Apache Cassandra® locally then explain how to run benchmarks to help evaluate what level of performance (especially throughput) you can expect from a dev laptop deployment.  

Survey Finds Data on Kubernetes Is No Longer a Pipe Dream

For people that work in infrastructure and application development, the pace of change is quick. Finish one project and it’s on to the next. Each iteration requires an evaluation asking if the right technology is being used and if it provides a new advantage. Kubernetes has been on the fast track of continuous evaluation. New projects and methodologies are continuously emerging and it can be hard to keep up. Then there is the question of running stateful services. 

The Data on Kubernetes community has released a report titled “Data on Kubernetes 2021” to give us a snapshot of where our industry sits with stateful workloads. Over 500 executives and tech leaders were asked some very direct and insightful questions about how they use Kubernetes. It turns out that there were a lot of surprising finds. Some that I would have never predicted. Let’s dig into some of the highlights that stood out to me. 

Simplify Migrating From Kafka to Pulsar With Kafka Connect Support

Large-scale implementations of any system, such as the event-streaming platform Apache Kafka, often involve customizations and tools and plugins developed in-house. When it’s time to transition from one system to another, the task can become complicated, drawn-out, and error-prone. Often the benefits of an alternative system (which can include significant cost savings and other efficiencies) are outweighed by the risks and costs of migration. As a result, an organization can end up locked into a suboptimal situation, footing a bigger bill than necessary and missing out on modern features that help move the business forward faster. 

These risks and costs can be mitigated by making the transition process iterative, breaking off the vendor lock-in in small, manageable steps, and avoiding the "big bang" switch that often results in delayed delivery and increases the cost of running two systems in parallel for A|B testing. 

How To Connect Stateful Workloads Across Kubernetes Clusters

One of the biggest selling points of Apache Cassandra™ is its shared-nothing architecture, making it an ideal choice for deployments that span multiple physical data centers. So when our Cassandra as-a-service single-region offering reached maturity, we naturally started looking into offering it cross-region and cross-cloud. One of the biggest challenges in providing a solution that spans multiple regions and clouds is correctly configuring the network so that Cassandra nodes in different data centers can communicate with each other successfully, even as individual nodes are added, replaced, or removed. 

From the start of the cloud journey at DataStax, we selected Kubernetes as our orchestration platform, so our search for a networking solution started there. While we’ve benefited immensely from the ecosystem and have our share of war stories, this time we chose to forge our own path, landing on ad-hoc overlay virtual application networks (how’s that for a buzzword soup?). In this post, we’ll go over how we arrived at our solution, its technical overview, and a hands-on example with the Cassandra operator.

Build A Crypto Price Tracker Using Node.js and Cassandra

Since the big bang in the data technology landscape happened a decade and a half ago, giving rise to technologies like Hadoop, which cater to the four ‘V’s. — volume, variety, velocity, and veracity there has been an uptick in the use of databases with specialized capabilities to cater to different types of data and usage patterns. You can now see companies using graph databases, time-series databases, document databases, and others for different customer and internal workloads.

Apache Cassandra™ is a wide-column, NoSQL database that is ideal when used for append-only, write-intensive workloads that capture data from IoT sensors, GPS devices, transaction logs, any time-series applications, and more. A lot of these applications need to be coupled with visualization engines for creating reports and dashboards. As most of the visualization libraries are written in JavaScript, using Node.js to interact with the database and the visualization engine is a good idea.  

Apache Cassandra 4.0: Taming Tail Latencies with Java 16 ZGC

Like so many others in the Apache Cassandra community, I’m extremely excited to see that the 4.0 release is finally here. There are many, many improvements to Cassandra 4.0. One enhancement that is more important than it might look is the addition of support for Java versions 9 and up. This was not trivial, because Java 9 made changes to some internal APIs that the most performance-oriented Java projects like Cassandra relied on (you can read more about this here).

This is a big deal because with Cassandra 4.0, you not only get the direct improvements to performance added by the Apache Cassandra committers, you also unlock the ability to take advantage of seven years of improvements in the JVM (Java Virtual Machine) itself.

Best Practices for Data Pipeline Error Handling in Apache NiFi

According to a McKinsey report, ”the best analytics are worth nothing with bad data”. We as data engineers and developers know this simply as "garbage in, garbage out". Today, with the success of the cloud, data sources are many and varied. Data pipelines help us to consolidate data from these different sources and work on it. However, we must ensure that the data used is of good quality. As data engineers, we mold data into the right shape, size, and type with high attention to detail. 

Fortunately, we have tools such as Apache NiFi, which allow us to design and manage our data pipelines, reducing the amount of custom programming and increasing overall efficiency. Yet, when it comes to creating them, a key and often neglected aspect is minimizing potential errors.

Cloud-Native Data Platform Frees Developers to Focus on App Development

I had the opportunity to meet with Robin Schumacher, Chief Product Officer and Jonathan Ellis, Co-founder and Chief Technology Officer of DataStax at their Accelerate user conference where DataStax CEO Billy Bosworth introduced DataStax Constellation in his keynote.

Constellation is a cloud data platform that simplifies development and operation of modern applications. Constellation will launch later this year with two cloud services: DataStax Apache Cassandra as a Service and DataStax Insights. DataStax Apache Cassandra as a Service will deliver scale-up and scale-down Cassandra clusters, on consumption-based pricing.

Cassandra DataStax: Developer Guide With Spring Data Cassandra

I did this POC when the latest version was Spring 4.x. Please check the latest version of Cassandra and Spring. We will discuss a Cassandra implementation.

Download and Installation

1. Tarball Installation

DataStax DB

mkdir -p /var/log/cassandrasudo
sudo chmod 777 /var/log/cassandrasudo

mkdir -p /var/lib/cassandra/datasudo
chmod 777 /var/lib/cassandra/datasudo

mkdir -p /var/lib/cassandra/commitlogsudo
chmod 777 /var/lib/cassandra/commitlogsudo

mkdir -p /var/lib/cassandra/saved_cachessudo
chmod 777 /var/lib/cassandra/saved_caches
    • How to run Cassandra: Go to the DataStax Cassandra installed folder on Mac/Linux/Unix env:
      cd /Users/<userName>/dse-<version>/bin
      sudo ./dse cassandra -f
      
      //This above command Cassandra DB on your local system. Hit enter to quit from ruining server in background and start CQL query console.
      sudo ./cqlsh

Create Schema: