The Difference Between TokuMX Partitioning and Sharding

In my last post, I described a new feature in TokuMX 1.5—partitioned collections—that’s aimed at making it easier and faster to work with time series data. Feedback from that post made me realize that some users may not immediately understand the differences between partitioning a collection and sharding a collection. In this post, I hope to clear that up.

On the surface, partitioning a collection and sharding a collection seem similar. Both actions take a collection and break it into smaller pieces for some performance benefit. Also, the terms are sometimes used interchangeably when discussing other technologies. But for TokuMX, the two features are very different in purpose and implementation. In describing each feature’s purpose and implementation, I hope to clarify the differences between the two features.

What Is Sharding?

In this article, I will tell you a few things about sharding and explain why it is actually an important technique. 

Despite its significance, sharding also has some cons, and there are certain problems you may encounter if you decide to use it. What are they? I’ll explain that below.

MongoDB to Couchbase for Developers, Part 2: Database Objects

MongoDB developers and DBAs work with physical clusters, machines, instances, storage systems, disks, etc. All MongoDB users, developers, and their applications work with logical entities: databases, collections, documents, fields, shards, users, and data types.  There are a lot of similarities with Couchbase since both are document(JSON)- oriented databases.  Let’s compare and contrast the two with respect to database objects.  You may also refer back to Part 1 of this series comparing the architecture.  

MongoDB Organization

A full list of MongoDB schema objects is listed here and here. A database instance can have many databases, a database can have many collections, and each collection can have many indexes.  Each collection can be sharded into multiple chunks on multiple nodes of a cluster using a hash-sharding strategy or an index sharding strategy.  The MongoDB indexes are local to their data.  Therefore, the indexes use the same strategy as the collection it is created on.

Choosing the Right Database for Your Applications

As our business, Shopee, boomed, our team faced severe challenges in scaling our back-end system to meet the demand. Our previous article introduces how we use TiDB, an open-source, MySQL-compatible, hybrid transactional and analytical processing (HTAP) database, to scale-out our system so that we can deliver better service for our users without worrying about database capacity.

There are so many databases available in the market. How do you choose the right one? In this post, I'll share our thoughts with you. I hope this post can help you when you're comparing multiple databases and looking for the right fit for your application.

To Shard or Not to Shard

The IT industry is one big word generator. Even better, it is a meaning generator. There are words or acronyms to describe every single functionality, but at the same time, one word can describe a bunch of different functionalities, or one functionality can be described by two or more different words. Recently, I found an old article about the differences between sharding and partitioning in general. This article tries to describe sharding as partitioning. It is not my aim to judge this, but I would like to write a couple of things about the use of sharding.

How Does It Work?

Sharding is simply a division of a big set of data into many small packs. Databases are the specimen — you divide your old big database in many small databases, each located on a separate machine. You rely on the application level layer to decide to which "shard" to send a query (here is a small difference from partitioning, where the decision is made at the database level).

Overview of Common Data-Keeping Techniques Used in a Distributed Environment

This article summarizes a very high-level overview of the common data handling techniques used in distributed environments along with some of their key points and advantages.

Normalization

Remember those old days of RDBMS where we used to organize the associative set of columns in the same table with a foreign key as referential entities, mostly to reduce the redundancy of data across different tables? For example, instead of putting 'employee_ name' column in employee's personal_detail table and address_detail table, we used to keep it in personal_details only, whereas 'emp_id' can be a foreign key in the address_detail table.

A Guide to Resolving the Cross-Database Query Problem with A Single SQL Statement

Recently, an e-commerce user experienced a sharp increase in access volume due to rapid business development, resulting in bottlenecks in database capacity and performance. To reduce the database size and improve performance, the user decided to implement vertical sharding on the architecture. Sharding is performed by table, which results in less of an impact on applications and supports clear and simple sharding rules.

The user vertically divided data into three databases according to members, commodities, and orders. After the vertical sharding was performed, the data was distributed to different database instances, reducing the data volume in each database and increasing the number of instances. This process seems simple but is difficult to implement. This is because once sharding is introduced, a query originally implemented in one database instance will now be implemented across two database instances.