Data Modeling in Cassandra and Astra DB

What does it take to build an efficient and sound data model for Apache Cassandra and DataStax Astra DB? Where would one start? Are there any data modeling rules to follow? Can it be done consistently time and time again? The answers to these and many other questions can be found in the Cassandra data modeling methodology.

In this post, we present a high-level overview of the data modeling methodology for Cassandra and Astra DB, and share over half a dozen complete data modeling examples from various real-life domains. We apply the methodology to create Cassandra and Astra DB data models for IoT, messaging data, digital library, investment portfolio, time series, shopping cart, and order management. We even provide our datasets and queries for you to try.

5 Data Models for IoT

Apache Cassandra is a rock-solid choice for managing IoT and time series data at scale. The most popular use case of storing, querying, and analyzing time series generated by IoT devices in Cassandra is well-understood and documented. In general, a time series is stored and queried based on its source IoT device. However, there exists another class of IoT applications that require quick access to the most recent data generated by a collection of IoT devices based on a known state. The question that such applications need to answer is: Which IoT devices or sensors are currently reporting a specific state? In this blog post, we focus on this question and provide five possible data modeling solutions to efficiently answer it in Cassandra.

Introduction

The Internet of Things (IoT) is generating massive amounts of time series data that needs to be stored, queried, and analyzed. Apache Cassandra is an excellent choice for this task: not only because of its speed, reliability, and scalability but also because its internal data model has built-in support for time-ordered data.

Synchronization Methods for Many-To-Many Associations

The many-to-many association is a common thing in data modeling. In JPA entities, it is implemented as collections that store associated entities from the other side of the association. To keep collections consistent on both sides, developers usually implement data synchronization methods. This article will highlight common issues that happen when adding synchronized methods for many-to-many bidirectional associations in JPA entities.

Many-To-Many Bidirectional Associations: Why Synchronize?

Before speaking about sync methods, let's look at bidirectional associations and their implementation in JPA in detail. Imagine a blog application where we can mark every post with several tags. We can make two JPA entities for this application: Post and Tag with appropriate attributes like ID, text, etc. Now we need to establish the association between one post and many tags. To do this, let's define a tags attribute on the Post entity, the type of this attribute is Set<Tag>. Notice that each tag can be reused for more than one post. In this case, we can create the posts attribute of type Set<Post> in the Tag entity. This is the bidirectional many-to-many association: we have references to several entities on both sides of the association. And now, our data model looks like this:

NoSQL Migration Essentials

Need help with your NoSQL migration? Look no further than our "NoSQL Migration Essentials" Refcard. We walk through the primary steps for moving out of a relational database, plus important design principles to understand and consider in your migration process.

Readers will review key concepts that range from denormalizing and modeling data to defining access patterns, designing primary keys and indexes, and creating an entity relationship diagram — all demonstrated with a simple site application example. As a bonus, readers can use the included JSON structure at the end to interact with a NoSQL playground.

MongoDB to Couchbase, Part 4: Data Modeling

To Embed or Not to Embed. That is the Question.  - Hamlet

Data modeling is a well-defined and mature field in relational database systems. The model provides a consistent framework for application developers to add and manipulate the data. Everything is great until you need to change. This lack of schema flexibility was a key trigger for NoSQL database systems.  As we've learned before, both MongoDB and Couchbase are JSON-based document databases. JSON gives the developers schema flexibility: the indexes, collections, query engine provide access paths to this data. The developer uses MQL in MongoDB and N1QL in Couchbase to query this data.  Let's compare the modeling methods in Couchbase. Note: This article is short because the modeling options are similar. That's a good thing. Some differences in modeling options, access methods, and optimizations are highlighted below.

Introduction to Couchbase for Oracle Developers and Experts: Part 4: Data Modeling

There are three things important in the database world: Performance, Performance, and Performance.  Bruce Lindsay

Here’s Part 1, Part 2, and Part 3 of this series.

Let me start with a real-world effect of right modeling on the application performance.  Here's the excerpt from a talk by Amadeus engineers on their customer experience management application (traveler loyalty app) which they migrated from an enterprise RDBMS to Couchbase.  

Best Practices for Transforming Data in Snowflake

The death of the star-schema is not exaggerated. Gone are the days of all-encompassing data warehouse models and the 24-month projects to build them.

We live in a highly disruptive, event-driven world. New analytics are required almost daily to understand how our customers, business, and markets shift. A modern data stack using the speed, flexibility, and scalability of Snowflake needs to allow an organization to “model as you go” to answer critical business questions on the fly.

Migrating a Spacecraft Engineering Model in UML to a Knowledge Graph

Goal: Migrate the UML-based engineering model of a spacecraft to TypeQL

Why Do This Migration in the First Place?

The spacecraft lifecycle is roughly divided into seven consecutive design phases. Part of the early design phases deals with the feasibility of the intended mission. Feasibility is identified by assessing each design aspect that is needed to accomplish the specific mission.

This requires that engineers lay out all possible design options and iteratively go through them in relation to all the other engineering design options, ultimately ending up with a sound system solution.

Data Mining: Use Cases, Benefits, and Tools

In the last decade, advances in processing power and speed have allowed us to move from tedious and time-consuming manual practices to fast and easy automated data analysis. The more complex the data sets collected, the greater the potential to uncover relevant information. Retailers, banks, manufacturers, healthcare companies, etc., are using data mining to uncover the relationships between everything from price optimization, promotions, and demographics to how economics, risk, competition, and online presence affect their business models, revenues, operations, and customer relationships. Today, data scientists have become indispensable to organizations around the world as companies seek to achieve bigger goals than ever before with data science. In this article, you will learn about the main use cases of data mining and how it has opened up a world of possibilities for businesses.

Today, organizations have access to more data than ever before. However, making sense of the huge volumes of structured and unstructured data to implement improvements across the organization can be extremely difficult due to the sheer volume of information.

No-Code: ”It’s a Trap!”

Gartner predicts that by 2023, over 50% of medium to large enterprises will have adopted a Low-code/No-code application as part of their platform development.

The proliferation of Low-code/No-code tooling can be partially attributed to the COVID-19 pandemic, which has put pressure on businesses around the world to rapidly implement digital solutions. However, adoption of these tools — while indeed accelerated by the pandemic — would have occurred either way.

Even before the pandemic, the largest, richest companies had already formed an oligopsony around the best tech talent and most advanced development tools. Low-Code/No-code, therefore, is an attractive solution for small and mid-sized organizations to level the playing field, and it does so by giving these smaller players the power to do more with their existing resources.

While these benefits are often realized in the short term, the long-term effect of these tools is often shockingly different. The promise of faster and cheaper delivery is the catch — or lure — inside this organizational mousetrap, whereas backlogs, vendor contracts, technical debts, and constant updates are the hammer.

So, what exactly is the No-Code trap, and how can we avoid it?

What is a No-Code Tool?

First, let's make sure we clear up any confusion regarding naming. So far I have referred Low-Code and No-Code as if they were one term. It’s certainly easy to confuse them — even large analyst firms seem to have a hard time differentiating between the two — and in the broader context of this article, both can lead to the same set of development pitfalls.

Under the magnifying glass, however, there are lots of small details and capabilities that differentiate Low-code and No-code solutions. Most of them aren’t apparent at the UI level, leading to much of the confusion between where the two come from.

In this section, I will spend a little bit of time exploring the important differences between those two, but only to show that when it comes to the central premise of this article they are virtually equivalent.

Low-Code vs. No-Code Tools

The goal behind Low-Code is to minimize the amount of coding necessary for complex tasks through a visual interface (such as Drag 'N' Drop) that integrates existing blocks of code into a workflow.

Skilled professionals have the potential to work smarter and faster with Low-Code tools because repetitive coding or duplicating work is streamlined. Through this, they can spend less time on the 80% of work that builds the foundation and focuses more on optimizing the 20% that makes it different. It, therefore, takes on the role of an entry-level employee doing the grunt work for more senior developers/engineers.

No-Code has a very similar look and feel to Low-Code, but is different in one very important dimension. Where Low-Code is meant to optimize the productivity of developers or engineers that already know how to code (even if just a little), No-Code is built for business and product managers that may not know any actual programming languages. It is meant to equip non-technical workers with the tools they need to create applications without formal development training.

No-Code applications need to be self-contained and everything the No-Code vendor thinks the user may need is already built into the tool.

As a result, No-Code applications create a lot of restrictions for the long-term in exchange for quick results in the short-term. This is a great example of a 'deliberate-prudent' scenario in the context of the Technical Debt Quadrant, but more on this later.

Advantages of No-Code Solutions

The appeal of both Low-Code and No-Code is pretty obvious. By removing code organizations can remove those that write it — developers — because they are expensive, in short supply, and fundamentally don’t produce things quickly.

The benefits of these two forms of applications in their best forms can be pretty substantial:
  • Resources: Human Capital is becoming increasingly scarce — and therefore expensive. This can stop a lot of ambitious projects dead in their tracks. Low-Code and No-Code tools minimize the amount of specialized technical skills needed to get an application of the ground, which means things can get done more quickly and at a lower cost.
  • Low Risk/High ROISecurity processes, data integrations, and cross-platform support are all built into Low-Code and No-Code tools, meaning less risk and more time to focus on your business goals.
  • Moving to Production: Similarly, for both types of tools a single click is all it takes to send or deploy a model or application you built to production.
Looking at these advantages, it is no wonder that both Low-Code and No-Code have been taking industries by storm recently. While being distinctly different in terms of users, they serve the same goal — that is to say, faster, safer and cheaper deployment. Given these similarities, both terms will be grouped together under the 'No-Code' term for the rest of this article unless otherwise specified.

List of No-Code Data Tools

So far, we have covered the applications of No-Code in a very general way, but for the rest of this article, I would like to focus on data modeling. No-Code tools are prevalent in software development, but have also, in particular, started to take hold in this space, and some applications even claim to be an alternative to SQL and other querying languages (crazy, right?!). My reasons for focusing on this are two-fold: 
Firstly, there is a lot of existing analysis around this problem for software development and very little for data modeling. Secondly, this is also the area in which I have the most expertise.
Now let's take a look at some of the vendors that provide No-Code solutions in this space. These in no way constitute a complete list and are, for the most part, not exclusively built for data modeling. 

1. No-Code Data Modeling in Power BI

Power BI was created by Microsoft and aims to provide interactive visualizations and business intelligence capabilities to all types of business users. Their simple interface is meant to allow end-users to create their own reports and dashboards through a number of features, including data mapping, transformation, and visualization through dashboards. Power BI does support some R coding capabilities for visualization, but when it comes to data modeling, it is a true No-Code tool.

2. Alteryx as a Low-Code Alternative

Alteryx is meant to make advanced analytics accessible to any data worker. To achieve this, it offers several data analytics solutions. Alteryx specializes in self-service analytics with an intuitive UI. Their offerings can be used as Extract, Transform, Load (ETL) Tools within their own framework. Alteryx allows data workers to organize their data pipelines through their custom features and SQL code blocks. As such, they are easily identified as a Low-Code solution.

3. Is Tableau a No-Code Data Modeling Solution?

Tableau is a visual analytics platform and a direct competitor to Power BI. They were recently acquired by Salesforce which is now hoping to 'transform the way we use data to solve problems—empowering people and organizations to make the most of their data.' It is also a pretty obvious No-Code platform that is supposed to appeal to all types of end-users. As of now, it offers fewer tools for data modeling than Power BI, but that is likely to change in the future.

4. Looker is a No-Code Alternative to SQL

Looker is a business intelligence software and big data analytics platform that promises to help you explore, analyze, and share real-time business analytics easily. Very much in line with Tableau and Power BI, it aims to make non-technical end-users proficient in a variety of data tasks such as transformation, modeling, and visualization.

You might be wondering why I am including so many BI/Visualization platforms when talking about potential alternatives to SQL. After all, these tools are only set up to address an organization's reporting needs, which constitute only one of the use cases for data queries and SQL. This is certainly a valid point, so allow me to clarify my reasoning a bit more.

While it is true that reporting is only one of many potential uses for SQL, it is nevertheless an extremely important one. There is a good reason why there are so many No-Code BI tools in the market—to address heightening demand from enterprises around the world — and therefore, it is worth taking a closer look at their almost inevitable shortcomings.

Building a Scalable E-Commerce Data Model

Introduction

If selling products online is a core part of your business, then you need to build an e-commerce data model that’s scalable, flexible, and fast. Most off-the-shelf providers like Shopify and BigCommerce are built for small stores selling a few million dollars in orders per month, so many e-commerce retailers working at scale start to investigate creating a bespoke solution.

Continue reading "Building a Scalable E-Commerce Data Model"

Enterprise Data Management: Stick to the Basics

Lots of people have increasing volumes of data and are trying to run data management programs to better sort it. Interestingly, people's problems are pretty much the same throughout different sectors of any industry, and data management helps them configure solutions.

The fundamentals of enterprise data management (EDM), which one uses to tackle these kinds of initiatives, are the same whether one is in the health sector, a telco travel company, or a government agency, and more! Therefore, the fundamental practices that one needs to follow to manage data are similar from one industry to another.

Data Modeling in Salesforce and Heroku Data Services

This is the third article documenting what I’ve learned from a series of 10 Trailhead Live video sessions on Modern App Development on Salesforce and Heroku.

In these articles I’m walking you through how to combine Salesforce with Heroku to build an “eCars” app —  a sales and service application for a fictitious electric car company (“Pulsar”) that allows users to customize and buy cars, service techs to view live diagnostic info from the car, and more. In case you missed my first article, you can find the link to it below and start from the beginning. Otherwise, if you’re specifically looking for data modeling, you’re in the right place.

Data Modeling in Cassandra

In Relational Data Models, we model a relation/table for every object in the domain. This is not exactly the case in Cassandra. This post will elaborate more on the aspects we need to consider while doing data modeling in Cassandra. The following is the rough overview of Cassandra Data Modeling.

As we can see from the diagram above, Conceptual Data Modeling and Application Queries are the inputs to be considered for building the model. Conceptual Data Modeling remains the same for any modeling(Be it Relational Database or Cassandra) as it is more about capturing knowledge about the needed system functionality in terms of Entity, Relations and their Attributes(Hence the name – ER Model).

Data Modeling Tools Detailed Comparison

A data modeling tool or a database modeling tool is an application that helps data modelers to create and design databases structureThus, data modeling tools make the Data modeling process easier and provide many features that help data modelers to understand their data. 

Actuallythere are many different data modeling tools available for different database platforms. This multitude of tools available makes it very difficult to choose a tool that suits the user's needs.  

Designing Microservices With Cassandra

As a thriving software development technique, microservices — and its underlying architecture — remain foundational to cloud-native applications. Apache Cassandra is a natural complement given that it's a database designed for the cloud. This Refcard examines the benefits of microservices architecture, demonstrates recommended data modeling techniques, and explains key microservice design principles for Cassandra using a sample hotel application.

Hybrid Relational/JSON Data Modeling and Querying

JSON has become the de facto standard for sending and receiving data. Within relational databases, JSON support includes hybrid data modeling and querying via standard SQL. For new applications, or those being refactored, now is the perfect time to adopt a hybrid relational/JSON data model to streamline development and provide greater flexibility in the future.

Apache Cassandra

Distributed non-relational database Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors and is used at some of the most well-known, global organizations. This Refcard covers data modeling, Cassandra architecture, replication strategies, querying and indexing, libraries across eight languages, and more.