distributed database | The Blog Pros

April 21, 2022

ScyllaDB Cloud Using Node.js (Part 2): CRUD

In the first part of this tutorial, I walked you through how to create a ScyllaDB cluster in the cloud and connect to the database using CQLSH and ScyllaDB Drivers. In part 2 here, we will explore how to create, read, update and delete data using NodeJS and ScyllaDB Cloud.

You can also find the video series associated with this article on youtube.

April 16, 2022

Exploring Multi-Region Database Deployment Options With a Slack-Like Messenger

Distributed database deployments across multiple regions are becoming commonplace. And there are several reasons for that. More and more applications have to comply with data residency requirements such as GDPR, serve user requests as fast as possible from the data centers closest to the user, and withstand cloud region-level outages.

This article reviews the most widespread multi-region deployment options for distributed transactional databases by designing a data layer for a Slack-like corporate messenger.

March 28, 2022

Getting Started With ScyllaDB Cloud Using Node.js (Part 1)

In this article, we will review the basics of ScyllaDB, then create and deploy a cluster on AWS using Scylla Cloud.

What’s a CRUD App?

CRUD stands for Create, Read, Update and Delete. In this article, we will build a simple application that will connect to our database and do just that using NodeJS and Scylla Cloud.

January 29, 2022

What Java Developers Need to Know About Geo-Distributed Databases

I’ve been working with distributed systems, platforms, and databases for the last seven years. Back in 2015, many architects began using distributed databases to scale beyond the boundaries of a single machine or server. They selected such a database for its horizontal scalability, even if its performance remained comparable to a conventional single-server database.

Now, with the rise of cloud-native applications and serverless architecture, distributed databases need to do more than provide horizontal scalability. Architects require databases that can stay available during major cloud region outages, enable hybrid cloud deployments, and serve data close to customers and end users. This is where geo-distributed databases come into play.

January 14, 2022

ClickHouse or StarRocks? Here is a Detailed Comparison

A New Choice of Column DBMS

Hadoop was developed 13 years ago. Its suppliers have been enthusiastic about offering open-source plug-ins, as well as technical solutions. This, on one hand, has resolved the problems of users, while it has also led to the high cost of maintenance, thus Hadoop gradually lost its share of the market. Users are calling for a simple and scalable database at a low cost, therefore the column DDBs got increased attention.

Brief Intro to ClickHouse

ClickHouse is an open-source database by the owner of Yandex, Russia's largest search engine. It has an enhanced performance compared to many commercial MPP databases, such as Vertica or InfiniDB. ClickHouse has gained increased popularity among companies besides Yandex, for the ordinary analytical business which is more structured and has fewer data changes, they can be put into flat tables and into ClickHouse thereafter.

December 17, 2021

Coordinating an Apache Ignite Cluster With GridGain Control Center

Bundling various data sources, APIs, services, applications, and several data streams while managing application data integration can become cumbersome. It’s so complex that it typically results in application performance loss. So, database administrators use Apache Ignite, a distributed database that provides high-performance computing capabilities using in-memory speed. Integrating Apache Ignite as an in-memory caching or distributed database solution helps improve the velocity and performance of complex architecture.

But, at the same time, this solution presents new challenges: we’re integrating yet another component into our already complex architecture. GridGain provides a solution to this challenge enabling monitoring, managing, and troubleshooting Apache Ignite clustered environments, whether they’re running as an on-premises solution or as a SaaS offering in the cloud.

November 13, 2021

Understanding MongoDB and NoSQL Database

NoSQL data sets arose in the latter part of the 2000s as the expense of storage drastically diminished. The days of expecting to create a complicated, hard to-oversee data model to avoid data replication were long gone and the primary expense of programming and development was now focused on the developers themselves, and hence NoSQL databases were brought into the picture to enhance their productivity.

As storage costs quickly diminished, the measure of data that applications expected to store increased, and the query expanded as well. This data was received in all shapes and sizes — organized, semi-organized, and polymorphic — and characterizing the schema ahead of time turned out to be almost incomprehensible. NoSQL databases permitted the developers to store colossal measures of unstructured data, providing them with a ton of flexibility.

October 4, 2021

How We Trace a KV Database With Less Than 5% Performance Impact

TiKV is a distributed key-value database. It has higher performance requirements than a regular application, so tracing tools must have minimal impact. This article describes how we achieved tracing all requests' time consumption in TiKV with less than 5% performance impact.

Background Knowledge

Logs, metrics, and traces are the three pillars of system observability. The following figure shows their relationship:

September 30, 2021

TiDE: Developing a Distributed Database in a Breeze

Contributing to TiDB's codebase is not easy, especially for newbies. As a distributed database, TiDB has multiple components and numerous tools, written in multiple languages, including Go and Rust. Getting started with such a complicated system takes quite an effort.

So, in order to welcome newcomers to TiDB and make it easier for them to contribute to our community, we've developed a TiDB integrated development environment: TiDE. Created during TiDB Hackathon 2020, TiDE is a Visual Studio Code extension that makes developing TiDB a breeze. With this extension, developing a distributed system can be as easy as developing a local one.

July 17, 2021

Edge Persistence Explained

Have you heard the term “edge persistence” floating around the webiverse? If so what does it mean to you? If your answer is “not sure” then this blog is for you! If you think you have an idea, let me know where you think I got it right and where I might have been off. Edge computing is the first aspect of this concept. Edge computing is the salt to cloud’s pepper. Blending edge computing with the cloud creates flexibility you would not be able to achieve with one or the other, with the added benefits of improved performance and reduced latency. So what do I mean when I say salt and pepper? They go great on dishes separately, but when combined can add the perfect finish to your cooking. Edge computing brings your computation and data storage closer to the location where you need it. It improves performance and reduces latency when running your application or technology. Now for the word persistence, I used a good old-fashioned dictionary. To be persistent is “existing for a long or longer than usual time or continuously.”

So edge persistence allows companies to globally distribute their applications, software, and technologies closer to the end-user location where it improves performance and reduce latency. When I say latency, what does that mean? Latency “is an expression of how much time it takes for a data packet to travel from one designated point to another”. It does so continuously for long periods of time.” I could end this blog here, but that wouldn’t be a very informative blog, so let’s go deeper.

June 30, 2021

Distributed Databases: An Overview

A single database server for a small set of applications and data has historically worked well. However, when exposed to a large, public user base, the only way to increase the capacity of these servers is to upgrade them to a more expensive server.

To improve capacity, move the database software to another single machine with more memory, more disk space, and more processors. This is "vertical scaling". The drawback to this approach is that it may require downtime. There's also a ceiling on the performance that can be obtained from a single machine. (See Herb Sutter's The Free Lunch is Over).

February 9, 2021

New Feature of Interference Cluster Release in Version 2021.1

Introduction

The 2021.1 version of the interference cluster has been released. (The previous article, in which I talk about the basic features of this software, can be found here.). Much attention was paid to improving overall performance and stability. In my opinion, an interesting feature has appeared, which I want to talk about in this short article.

Previously, the concept of the interference cluster was kept strictly within the framework of a server-side service that provided persistence and event interaction services to some server-side java applications. Since the concept of the interference as a database did not provide for JDBC connections, we could not access the data from the outside in any way. Any interactions were possible only between applications of the cluster nodes, each of which contains persistent storage.

January 7, 2021January 13, 2021

Why We Disable Linux’s THP Feature for Databases

Linux's memory management system is clear to the user. However, if you're not familiar with its working principles, you might meet unexpected performance issues. That's especially true for sophisticated software like databases. When databases are running in Linux, even small system variations might impact performance.

After an in-depth investigation, we found that Transparent Huge Page (THP), a Linux memory management feature, often slows down database performance. In this post, I'll describe how THP causes performance to fluctuate, the typical symptoms, and our recommended solutions.

September 11, 2020March 20, 2023

Distributed SQL Essentials

Distributed SQL databases combine the resilience and scalability of a NoSQL database with the full functionality of a relational database. In this Refcard, we explore the fundamentals of distributed SQL, including architecting for availability, handling schema design challenges, using JSON and columnar indexes, as well as assessing approaches to replication.

December 20, 2019

Implementing PostgreSQL User-Defined Table Functions in YugabyteDB

Welcome part two of a three-part series of posts on PostgreSQL’s table functions. These functions can be easily leveraged in a distributed SQL database like YugabyteDB, which is PostgreSQL compatible.

In part one, I gave a brief introduction to PostgreSQL’s table functions. Part three will cover some realistic use cases. I’ll introduce this second post by quoting that paragraph:

July 16, 2019

RedisTimeSeries GA: Making the 4th Dimension (in Redis) Truly Immersive

On the 27th of June, we announced the general availability (GA) of RedisTimeSeries v1.0. RedisTimeSeries is a Redis module developed by Redis Labs to enhance your experience managing time series data with Redis. We released RedisTimeSeries in preview/beta mode over six months ago and appreciate all the great feedback and suggestions we received from the community and our customers as we worked together on this first GA version. To mark this release, we performed a benchmark, which achieved 125K queries per second with RedisTimeSeries as compared to other time series approaches in Redis. Skip ahead for the full results, or take a moment to first learn about what led us to build this new module.

Why RedisTimeSeries?

Many Redis users have been using Redis for time series data for almost a decade and have been happy and successful doing so. As we will explain later, these developers are using the generic native data structures of Redis. So let’s first take a step back to explain why we decided to build a module with a dedicated time series data structure.

June 7, 2019

Optimizing Database Performance and Efficiency

It's easy for modern, distributed, high-scale applications to hide database performance and efficiency problems. Optimizing performance of such complex systems at scale requires some skill, but more importantly it requires a sound strategy and good observability, because you can't optimize what you can't measure. This session explains a performance measurement and optimization process anyone can use to deliver results predictably, optimizing customer experience while freeing up compute resources and saving money.

The session begins with what to measure and how; how to analyze it; how to categorize problems into one of three types; and three matching strategies to use in optimization as a result. It is a recursive method that can be used at any scale, from a data center with many types of databases cooperating as one, to a single server and drilling down to a single query. Along the way, we'll discuss related concepts such as internally- and externally-focused golden signals of performance and resource sufficiency, workload quality of service, and more.