Loading Vector Data Into Cassandra in Parallel Using Ray

This blog will delve into the nuances of combining the prowess of DataStax Astra with the power of Ray and is a companion to this demo on GitHub. We’ll explore the step-by-step procedure, the pitfalls to avoid, and the advantages this dynamic duo brings to the table. Whether you’re a data engineer, a developer looking to optimize your workflows, or just a tech enthusiast curious about the latest in data solutions, this guide promises insights aplenty. Soon, you’ll be able to use Cassandra 5 in place of AstraDB in this demo — but for a quick start, AstraDB is a great way to get started with a vector-search-compliant Cassandra database!

Introduction

Vector search is a technology that works by turning data that we are interested in into numerical representations of locations in a coordinate system. A database that holds and operates on vectors is called a vector store. This functionality is coming to Cassandra 5.0, which will be released soon. To preview this functionality, we can make use of DataStax Astra. Similar items have their vector locations close to each other in this space. That way, we can take some items and find items similar to them. In this case, we have bits of text that are embedded. Embedding takes text into a machine-learning model that returns vectors that represent the data. You can almost think about embedding and translating data from real text into vectors. 

Loading Streaming Data Into Cassandra Using Spark Structured Streaming

When creating real-time data platforms, data streaming is a low-latency, high-throughput method of moving data. Where batch processing methods necessarily introduce delays in order to gather a batch worth of data, stream processing methods act on steam events as they occur, with as little delay as possible. In this blog and associated repo, we will discuss how streaming data can be compatible with Cassandra, with Spark Structured Streaming as an intermediary. Cassandra is designed for high-volume interactions and, thus, a great resource for streaming workflows. For simplicity and speed, we are using DataStax’s AstraDB in this demo.

Introduction 

Streaming data is normally incompatible with standard SQL and NoSQL databases since they can consist of differently structured data with messages only differentiated by timestamp.  With advances in database technologies and continuous development, many databases have evolved to better accommodate streaming data use cases. Additionally, there are specialized databases, such as time-series databases and stream processing systems, that are designed explicitly for handling streaming data with high efficiency and low latency. 

Predicting Stock Data With Cassandra and TensorFlow

The scenario that this blog will cover is a time-series forecasting of a specific stock price. The problem itself is very common and widely known. That being said, this is a technology demo and is in no way intended as market advice. The purpose of the demo is to show how Astra and TensorFlow can work together to do time-series forecasting. The first set is setting up your database.

Getting Test Data

Now that you have the database, credentials, and bundle file on your machine, let’s ensure you get the data needed to run this tutorial. This data is open source and available on many platforms. One of which is the Kaggle platform, where people have already done the work of gathering this data via APIs. 

Training a Handwritten Digits Classifier in Pytorch With Apache Cassandra Database

Handwritten digit recognition is one of the classic tasks undertaken by students when learning the basics of Neural Networks and Computer Vision. The basic idea is to take a number of labeled images of handwritten digits and use those to train a neural network that is able to classify new unlabeled images. For this demo, we show how to use data stored in a large-scale database as our training data. We also explain how to use that same database as a basic model registry. This addition can enable model serving as well as potentially future retraining.

Introduction

MNIST is a set of datasets that share a particular format useful for educating students about neural networks while presenting them with diverse problems. The MNIST datasets for this demo are a collection of 28 by 28-pixel grayscale images as data and classifications 0-9 as potential labels. This demo works with the original MNIST handwritten digits dataset as well as the MNIST fashion dataset. 

Apache Flink

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. This tutorial will show you step-by-step how to use Astra as a sink for results computed by Flink. 

This code is intended as a fairly simple demonstration of how to enable an Apache Flink job to interact with Astra. There is certainly room for optimization here. 

Building a Cassandra To-Do List ChatGPT Plugin

ChatGPT plugins offer a way to extend the capabilities of OpenAI's ChatGPT by integrating custom functionalities directly into the conversational AI interface. These plugins enable users to interact with specialized features, transforming ChatGPT into a versatile tool for various tasks. Think of a ChatGPT plugin as a handy tool belt that equips OpenAI's ChatGPT with specialized superpowers. Just like adding a new gadget to your arsenal, a plugin empowers ChatGPT to perform specific tasks seamlessly within the conversation. 

In this blog, we'll dive into implementing the Cassandra to-do list ChatGPT plugin, which acts as a virtual personal assistant for managing your to-do list. It's like having a dedicated task organizer right beside you during your AI-powered conversations. With this plugin, you can effortlessly create, view, and delete tasks, bringing a new level of productivity and organization to your chat-based interactions with ChatGPT.

CassIO: The Best Library for Generative AI Inspired by OpenAI

If you’re a frequent user of ChatGPT, you know the tendency it has to wander off into what is known as hallucinations. A great collection of statistically correct words that have no basis in reality. A few months ago, a prompt about using Apache Cassandra for large language models (LLMs) and LangChain resulted in a curious response. ChatGPT reported that not only was Cassandra a good tool choice when creating LLMs, but OpenAI used Cassandra with an MIT-licensed Python library they called CassIO. Into the rabbit hole we went, and through more prompting, ChatGPT described many details about how CassIO was used. It even included some sample code and a website. Subsequent research found no evidence of CassIO outside of ChatGPT responses, but the seed was sown. If this library didn’t exist, it needed to, and we started work on it shortly after.

Best hallucination ever. 

The Serverless Database You Really Want

The dreaded part of every site reliability engineer’s (SRE) job eventually: capacity planning. You know, the dance between all the stakeholders when deploying your applications. Did engineering really simulate the right load and do we understand how the application scales? Did product managers accurately estimate the amount of usage? Did we make architectural decisions that will keep us from meeting our SLA goals? And then the question that everyone will have to answer eventually: how much is this going to cost? This forces SREs to assume the roles of engineer, accountant, and fortune teller.

The large cloud providers understood this a long time ago and so the term “cloud economics” was coined. Essentially this means: rent everything and only pay for what you need. I would say this message worked because we all love some cloud. It’s not a fad either. SREs can eliminate a lot of the downside when the initial infrastructure capacity discussion was maybe a little off. Being wrong is no longer devastating. Just add more of what you need and in the best cases, the services scale themselves — giving everyone a nice night’s sleep. All this without provisioning a server, which gave rise to the term “serverless.”

The End of the Beginning for Apache Cassandra

Editor’s note: This story originally ran on July 27, 2021, the day that Apache Cassandra 4.0 was released.

Today is a big day for those of us in the Apache Cassandra community. After a long uphill climb, Apache Cassandra 4.0 has finally shipped. I say finally because it has at times seemed like an elusive goal. I’ve been involved in the Cassandra project for almost 10 years now and I have seen a lot of ups and downs. So I feel this day marks an important milestone that isn’t just a version number. This is an important milestone in the lifecycle of a database project that has come into its own as an important database used around the world. The 4.0 release is not only incredibly stable in the history of Cassandra, but it’s also quite possibly the most stable release of any database. Now it’s ready to launch into the next 10 years of cloud-native data; it has the computer science and hard-won history to make a huge impact. Today’s milestone is the end of the beginning.

Kubernetes and Apache Cassandra: What Works (and What Doesn’t)

“I need it now and I need it reliable.”

– ANYONE WHO HASN’T DEPLOYED APPLICATION INFRASTRUCTURE

If you’re on the receiving end of this statement, we understand you in the K8ssandra community. Although we do have reason for hope ⁠— recent surveys have shown that Kubernetes (K8s) is growing in popularity, not only because it’s powerful technology, but because it actually delivers on reducing the toil of deployment.

The Future of Cloud-Native Databases Begins With Apache Cassandra 4.0

“Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust.” 

This was the first line of the highly impactful paper titled “Dynamo: Amazon’s Highly Available Key-Value Store.” Published in 2007, it was written at a time when the status quo of database systems was not working for the massive explosion of internet-based applications. A team of computer engineers and scientists at Amazon completely re-thought the idea of data storage in terms of what would be needed for the future, with a firm footing in the computer science of the past. 

Survey Finds Data on Kubernetes Is No Longer a Pipe Dream

For people that work in infrastructure and application development, the pace of change is quick. Finish one project and it’s on to the next. Each iteration requires an evaluation asking if the right technology is being used and if it provides a new advantage. Kubernetes has been on the fast track of continuous evaluation. New projects and methodologies are continuously emerging and it can be hard to keep up. Then there is the question of running stateful services. 

The Data on Kubernetes community has released a report titled “Data on Kubernetes 2021” to give us a snapshot of where our industry sits with stateful workloads. Over 500 executives and tech leaders were asked some very direct and insightful questions about how they use Kubernetes. It turns out that there were a lot of surprising finds. Some that I would have never predicted. Let’s dig into some of the highlights that stood out to me.