The Distributed Data Problem

Today, online retailers sell millions of products and services to customers all around the world.  This was more prevalent in 2020, as COVID-19 restrictions all but eliminated visits to brick-and-mortar stores and in-person transactions. Of course, consumers still needed to purchase food, clothing, and other essentials, and, as a result, worldwide digital sales channels rose to the tune of $4.2 trillion, up $900 billion from just a year prior.

Was it enough for those retailers to have robust websites and mobile apps to keep their customers from shopping with competitors?  Unfortunately, not. Looking across the eCommerce landscape of 2020, there were clear winners and losers. But what was the deciding factor?

Why Kubernetes Is the Best Technology for Running a Cloud-Native Database

We’ve been talking about migrating workloads to the cloud for a long time, but a look at the application portfolios of many IT organizations demonstrates that there’s still a lot of work to be done. In many cases, challenges with persisting and moving data in clouds continue to be the key limiting factor slowing cloud adoption, despite the fact that databases in the cloud have been available for years. 

For this reason, there has been a surge of recent interest in data infrastructure that is designed to take maximum advantage of the benefits that cloud computing provides. A cloud-native database achieves the goals of scalability, elasticity, resiliency, observability, and automation; the K8ssandra project is a great example. It packages Apache Cassandra and supporting tools into a production-ready Kubernetes deployment.

Kubernetes Data Simplicity: Getting Started With K8ssandra

You might have heard about the K8ssandra project and want to start contributing, or maybe you want to start using all of its features. If you aren’t familiar with K8ssandra (pronounced like “Kate Sandra”), you can read this overview before digging into the developer activities in this post.

In a nutshell, K8ssandra is an open-source distribution of Apache Cassandra™ for Kubernetes, which includes a rich set of trusted open-source services and tooling. K8ssandra comes with handy features that are baked-in and pluggable, which allows for flexible deployment and configuration.

Why a Cloud-Native Database Must Run on K8s

We’ve been talking about migrating workloads to the cloud for a long time, but a look at the application portfolios of many IT organizations demonstrates that there’s still a lot of work to be done. In many cases, challenges with persisting and moving data in clouds continue to be the key limiting factor slowing cloud adoption, despite the fact that databases in the cloud have been available for years.

For this reason, there has been a surge of recent interest in a data infrastructure that is designed to take maximum advantage of the benefits that cloud computing provides. A cloud-native database is one that achieves the goals of scalability, elasticity, resiliency, observability, and automation; the K8ssandra project is a great example. It packages Apache Cassandra and supporting tools into a production-ready Kubernetes deployment.

Create a Full-Stack App Using Nuxt.js, NestJS, and DataStax Astra DB (With a Little Help From GitHub Copilot)

Building a full-stack application can be daunting because you have to not only think about how the frontend will display the data but where the data will come from and where it’s stored. However, it’s not as hard as you might think to get the basics of a full-stack application up and running.

If you want to create a full-stack application, complete with dynamic data retrieved from a cloud database by an API, then watch the tutorial below, created by Eddie Jaoude. In his tutorial, Eddie shows you how to do it in less than an hour using Nuxt.js with VuetifyJS for the frontend, NestJS to create RESTful APIs, and DataStax’s Astra DB for the cloud database service. Also, you’ll use GitHub Copilot as your AI-powered pair programmer.

Getting Started With Apache Cassandra

Apache Cassandra® is a distributed NoSQL database that is used by the vast majority of Fortune 100 companies. By helping companies like Apple, Facebook, and Netflix process large volumes of fast-moving data in a reliable, scalable way, Cassandra has become essential for the mission-critical features we rely on today.

In this post, we will:

Developing an Enterprise-Level Apache Cassandra Sink Connector for Apache Pulsar

When DataStax started investing in streaming with Apache Pulsar™, we knew that one of the first things people would want to do was connect existing enterprise data sources to Apache Cassandra™ using Pulsar.

Apache Pulsar has a powerful framework called Pulsar IO to enable this kind of use case, and at DataStax we already had a best-in-class Kafka Connect Sink that enables you to store structured data coming from one or more Kafka topics into DataStax Enterprise, Apache Cassandra, and Astra.

How To Connect Stateful Workloads Across Kubernetes Clusters

One of the biggest selling points of Apache Cassandra™ is its shared-nothing architecture, making it an ideal choice for deployments that span multiple physical data centers. So when our Cassandra as-a-service single-region offering reached maturity, we naturally started looking into offering it cross-region and cross-cloud. One of the biggest challenges in providing a solution that spans multiple regions and clouds is correctly configuring the network so that Cassandra nodes in different data centers can communicate with each other successfully, even as individual nodes are added, replaced, or removed. 

From the start of the cloud journey at DataStax, we selected Kubernetes as our orchestration platform, so our search for a networking solution started there. While we’ve benefited immensely from the ecosystem and have our share of war stories, this time we chose to forge our own path, landing on ad-hoc overlay virtual application networks (how’s that for a buzzword soup?). In this post, we’ll go over how we arrived at our solution, its technical overview, and a hands-on example with the Cassandra operator.

Build A Crypto Price Tracker Using Node.js and Cassandra

Since the big bang in the data technology landscape happened a decade and a half ago, giving rise to technologies like Hadoop, which cater to the four ‘V’s. — volume, variety, velocity, and veracity there has been an uptick in the use of databases with specialized capabilities to cater to different types of data and usage patterns. You can now see companies using graph databases, time-series databases, document databases, and others for different customer and internal workloads.

Apache Cassandra™ is a wide-column, NoSQL database that is ideal when used for append-only, write-intensive workloads that capture data from IoT sensors, GPS devices, transaction logs, any time-series applications, and more. A lot of these applications need to be coupled with visualization engines for creating reports and dashboards. As most of the visualization libraries are written in JavaScript, using Node.js to interact with the database and the visualization engine is a good idea.  

Apache Cassandra 4.0: Taming Tail Latencies with Java 16 ZGC

Like so many others in the Apache Cassandra community, I’m extremely excited to see that the 4.0 release is finally here. There are many, many improvements to Cassandra 4.0. One enhancement that is more important than it might look is the addition of support for Java versions 9 and up. This was not trivial, because Java 9 made changes to some internal APIs that the most performance-oriented Java projects like Cassandra relied on (you can read more about this here).

This is a big deal because with Cassandra 4.0, you not only get the direct improvements to performance added by the Apache Cassandra committers, you also unlock the ability to take advantage of seven years of improvements in the JVM (Java Virtual Machine) itself.

Best Practices for Data Pipeline Error Handling in Apache NiFi

According to a McKinsey report, ”the best analytics are worth nothing with bad data”. We as data engineers and developers know this simply as "garbage in, garbage out". Today, with the success of the cloud, data sources are many and varied. Data pipelines help us to consolidate data from these different sources and work on it. However, we must ensure that the data used is of good quality. As data engineers, we mold data into the right shape, size, and type with high attention to detail. 

Fortunately, we have tools such as Apache NiFi, which allow us to design and manage our data pipelines, reducing the amount of custom programming and increasing overall efficiency. Yet, when it comes to creating them, a key and often neglected aspect is minimizing potential errors.

Build a TikTok Clone With a Twist

It is a really great time to be a developer. 

We have tons of APIs integrated within great tools for building dynamic, full stack apps. If you are a developer, you probably are using technologies like schemaless data stores, serverless architectures, JSON APIs, and/or the GraphQL language. 

Improving Apache Cassandra’s Front Door and Backpressure

We have improved Apache Cassandra's ability to handle high throughput workloads, while having enough safeguards in place to protect itself from potentially going out of memory. In order to better explain the change we have made, let us understand at a high level, on how an incoming request is processed by Cassandra before the fix, followed by what we changed, and the new relevant configuration knobs available.

How Inbound Requests Were Handled Before

Let us take the scenario of a client application sending requests to C* cluster. For the purpose of this blog, let us focus on one of the C* coordinator nodes.

Designing Microservices With Cassandra

As a thriving software development technique, microservices — and its underlying architecture — remain foundational to cloud-native applications. Apache Cassandra is a natural complement given that it's a database designed for the cloud. This Refcard examines the benefits of microservices architecture, demonstrates recommended data modeling techniques, and explains key microservice design principles for Cassandra using a sample hotel application.

Apache Cassandra

Distributed non-relational database Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors and is used at some of the most well-known, global organizations. This Refcard covers data modeling, Cassandra architecture, replication strategies, querying and indexing, libraries across eight languages, and more.

Performing Speculative Query Executions in Apache Cassandra With GoCQL

Speculative query executions can be a particularly valuable technique for addressing a number of Cassandra database issues — from faulty, slow, or unresponsive nodes, to network interruptions. Using speculative query execution allows a client to make database requests from multiple endpoints at the same time and have the requests compete to see which provides the quickest response. While setting up this race between requests does determine which node is most performant, performance isn’t usually the goal of a speculative query execution. Instead, the purpose is to make sure that queries receive successful server responses (streamlined execution time can certainly be a happy byproduct, however).

Important note before we go any further: speculative queries require the use of CPU and network resources, so it’s important to remember that the reliability or performance improvements they may yield aren’t without a cost.

Cassitory: Redundancy Tables Within Cassandra

After working with Cassandra for a while and not having Apache Spark as a viable option, running different queries may become quite a challenge.

In the eventual case where you want to execute a query that would not be possible with the current table structure, Cassandra recommends having redundancy tables because storage is cheaper than memory.

The State of Databases 2019

I had the opportunity to hear Dinesh Joshi,  Senior Software Engineer and Architect at Apple and an active member of the Apache Software Foundation share his thoughts on The State of Databases 2019 while attending Percona Live in Austin, Texas.

Data is growing and will continue to do so. In 2019, humans will generate 40 zettabytes of data. Data continues to grow in importance from a business standpoint. Things like flight systems data for airlines and electronic medical records for hospitals need to be backed up, protected, and available on a moments notice.