Is Your Data Infrastructure Stifling Innovation?

There are myriad reasons why an estimated 90% of startups fail. You need a great idea (and not just one great idea); you need inspiration, funding, smart people — and a fair amount of luck. Miss any one of these factors, and failure might be a foregone conclusion.

For young companies or small teams that build applications, data can be another stumbling block. The databases they rely on have historically stymied innovation by being complex and costly to spin up, manage, and maintain. Proofs of concept — ideas with the potential to turn into something big — can die before even being tested due to a lack of funding or database capacity.

The Distributed Data Problem

Today, online retailers sell millions of products and services to customers all around the world.  This was more prevalent in 2020, as COVID-19 restrictions all but eliminated visits to brick-and-mortar stores and in-person transactions. Of course, consumers still needed to purchase food, clothing, and other essentials, and, as a result, worldwide digital sales channels rose to the tune of $4.2 trillion, up $900 billion from just a year prior.

Was it enough for those retailers to have robust websites and mobile apps to keep their customers from shopping with competitors?  Unfortunately, not. Looking across the eCommerce landscape of 2020, there were clear winners and losers. But what was the deciding factor?

Optimizing Distributed Joins: Google Cloud Spanner and DataStax Astra DB

Distributed joins are commonly considered too expensive to use for real-time transaction processing. That is because, besides joining data, they also frequently require moving or shuffling data between nodes in a cluster, which can significantly affect query response times and database throughput. However, there are certain optimizations that can completely eliminate the need to move data to enable faster joins. In this article, we first review the four types of distributed joins, including shuffle join, broadcast join, co-located join, and pre-computed join. We then demonstrate how leading fully managed Relational and NoSQL databases, namely Google Cloud Spanner and DataStax Astra DB, support optimized joins that are suitable for real-time applications.

Four Types of Distributed Joins

Joins are used in databases to combine related data from one or more tables or datasets. Data is usually combined based on some condition that relates columns from participating tables. We call columns used in a join condition join keys and assume they are always related by equality operators.

Building Scalable Streaming Applications

DataStax has recently released its Astra Streaming, enabling developers to build streaming applications on top of an elastically scalable, multi-cloud messaging and event streaming platform powered by Apache Pulsar. This article will walk you through a short demo that will provide a great starting point for familiarizing yourself with this powerful new streaming service.

Here’s what you will learn:

Kubernetes Data Simplicity: Getting Started With K8ssandra

You might have heard about the K8ssandra project and want to start contributing, or maybe you want to start using all of its features. If you aren’t familiar with K8ssandra (pronounced like “Kate Sandra”), you can read this overview before digging into the developer activities in this post.

In a nutshell, K8ssandra is an open-source distribution of Apache Cassandra™ for Kubernetes, which includes a rich set of trusted open-source services and tooling. K8ssandra comes with handy features that are baked-in and pluggable, which allows for flexible deployment and configuration.

Why Pulsar Beats Kafka for a Scalable, Distributed Data Architecture

The leading open-source event streaming platforms are Apache Kafka and Apache Pulsar. For enterprise architects and application developers, choosing the right event streaming approach is critical, as these technologies will help their apps scale up around data to support operations in production.

Everyone wants results faster. We want applications that know what we want, even before we know ourselves. We want systems that constantly check for fraud or security issues to protect our data. We want applications that are smart enough to react and change plans when faced with the unexpected. And we want those services to be continuously available.

Why a Cloud-Native Database Must Run on K8s

We’ve been talking about migrating workloads to the cloud for a long time, but a look at the application portfolios of many IT organizations demonstrates that there’s still a lot of work to be done. In many cases, challenges with persisting and moving data in clouds continue to be the key limiting factor slowing cloud adoption, despite the fact that databases in the cloud have been available for years.

For this reason, there has been a surge of recent interest in a data infrastructure that is designed to take maximum advantage of the benefits that cloud computing provides. A cloud-native database is one that achieves the goals of scalability, elasticity, resiliency, observability, and automation; the K8ssandra project is a great example. It packages Apache Cassandra and supporting tools into a production-ready Kubernetes deployment.

Create a Full-Stack App Using Nuxt.js, NestJS, and DataStax Astra DB (With a Little Help From GitHub Copilot)

Building a full-stack application can be daunting because you have to not only think about how the frontend will display the data but where the data will come from and where it’s stored. However, it’s not as hard as you might think to get the basics of a full-stack application up and running.

If you want to create a full-stack application, complete with dynamic data retrieved from a cloud database by an API, then watch the tutorial below, created by Eddie Jaoude. In his tutorial, Eddie shows you how to do it in less than an hour using Nuxt.js with VuetifyJS for the frontend, NestJS to create RESTful APIs, and DataStax’s Astra DB for the cloud database service. Also, you’ll use GitHub Copilot as your AI-powered pair programmer.

Reaper 3.0 for Apache Cassandra Is Available

The K8ssandra team is pleased to announce the release of Reaper 3.1. Let’s dive into the features and improvements that 3.0 recently introduced (along with some notable removals) and how the newest update to 3.1 builds on that.

JDK11 Support

Starting with 3.1.0, Reaper can now compile and run with jdk11. Note that jdk8 is still supported at runtime.

Bring Streaming to Apache Cassandra with Apache Pulsar

Twitch, YouTube, Instagram, Facebook — virtually every major brand nowadays uses live streaming to connect and engage their audience. For enterprises and developers building cloud-native applications, this growing trend creates a need for streaming technologies that can reliably handle the rush of massive amounts of data, while also being flexible and easy to manage for developers.

One such technology is Apache Pulsar® — an open-source, distributed messaging and streaming platform that’s easy to deploy, simple to scale, and packed with developer-friendly APIs. So the next question is: how can you stream from Pulsar to Apache Cassandra®, the powerful NoSQL database designed to support data-heavy applications in the cloud?

Join our beginner-friendly Pulsar workshop on YouTube and learn how to connect Pulsar with Cassandra for streaming! In this post, we’ll set the scene with an introduction to Pulsar and guide you through four hands-on exercises where you’ll use these free, cloud-native technologies: Katacoda, Kesque, GitPod, and DataStax Astra DB. Each exercise will also be linked to the step-by-step instructions on the DataStax Developers GitHub wiki.

7 Reasons to Choose Apache Pulsar over Apache Kafka

So why did we build our messaging service using Apache Pulsar?

At DataStax, our mission is to empower developers to build cloud-native distributed applications by making cloud-agnostic, high-performance messaging technology easily available to everyone. Developers want to write distributed applications or microservices but don’t want the hassle of managing complex message infrastructure or getting locked into a particular cloud vendor. They need a solution that just works. Everywhere.

A Case for Databases on Kubernetes from a Former Skeptic

Kubernetes is everywhere. Transactional apps, video streaming services, and machine learning workloads are finding a home on this ever-growing platform. But what about databases? If you had asked me this question five years ago, the answer would have been a resounding “No!” — based on my experience in development and operations. In the following years, as more resources emerged for stateful applications, my answer would have changed to “Maybe,” but always with a qualifier: “It’s fine for development or test environments…” or “If the rest of your tooling is Kubernetes-based, and you have extensive experience…”

But how about today? Should you run a database on Kubernetes? With complex operations and the requirements of persistent, consistent data, let’s retrace the stages in the journey to my current answer: “In a cloud-native environment? Yes!

The End of the Beginning for Apache Cassandra

Editor’s note: This story originally ran on July 27, 2021, the day that Apache Cassandra 4.0 was released.

Today is a big day for those of us in the Apache Cassandra community. After a long uphill climb, Apache Cassandra 4.0 has finally shipped. I say finally because it has at times seemed like an elusive goal. I’ve been involved in the Cassandra project for almost 10 years now and I have seen a lot of ups and downs. So I feel this day marks an important milestone that isn’t just a version number. This is an important milestone in the lifecycle of a database project that has come into its own as an important database used around the world. The 4.0 release is not only incredibly stable in the history of Cassandra, but it’s also quite possibly the most stable release of any database. Now it’s ready to launch into the next 10 years of cloud-native data; it has the computer science and hard-won history to make a huge impact. Today’s milestone is the end of the beginning.

Aggregate Functions in Stargate’s GraphQL API

A new release of Stargate.io was applied to Astra DB that includes an exciting new feature: aggregate functions! If you’re not familiar with aggregate functions, they are functions that look at the data as a whole and perform a function like min(), max(), sum(), count() and avg().

Until now, aggregate functions were only available using cqlsh (the CQL Shell). However, with the Stargate 1.0.25 release they are now also available using the GraphQL API. In this blog entry, I’ll walk you through the process to get early access to this exciting new functionality in Stargate, and how to setup everything you need to test your own aggregate queries.

Taking Your Database Beyond a Single Kubernetes Cluster

Global applications need a data layer that is as distributed as the users they serve. Apache Cassandra has risen to this challenge, handling data needs for the likes of Apple, Netflix, and Sony. Traditionally, managing data layers for a distributed application was handled with dedicated teams to manage the deployment and operations of thousands of nodes — both on-premises and in the cloud.

To alleviate much of the load felt by DevOps teams, we evolved a number of these practices and patterns in K8ssandra, leveraging the common control plane afforded by Kubernetes (K8s) There has been a catch though — running a database (or indeed any application) across multiple regions or K8s clusters is tricky without proper care and planning up front.

Kubernetes and Apache Cassandra: What Works (and What Doesn’t)

“I need it now and I need it reliable.”

– ANYONE WHO HASN’T DEPLOYED APPLICATION INFRASTRUCTURE

If you’re on the receiving end of this statement, we understand you in the K8ssandra community. Although we do have reason for hope ⁠— recent surveys have shown that Kubernetes (K8s) is growing in popularity, not only because it’s powerful technology, but because it actually delivers on reducing the toil of deployment.

Multi-Cluster Cassandra Deployment With Google Kubernetes Engine (Pt. 2)

This is the second in a series of posts examining patterns for using K8ssandra to create Cassandra clusters with different deployment topologies.

In the first article in this series, we looked at how you could create a Cassandra cluster with two datacenters in a single cloud region, using separate Kubernetes namespaces in order to isolate workloads. For example, you might want to create a secondary Cassandra datacenter to isolate a read-heavy analytics workload from the datacenter supporting your main application.