DistSQL Applications: Building a Dynamic Distributed Database

Background

Ever since the release of ShardingSphere 5.0.0, DistSQL has been providing strong dynamic management capabilities to the ShardingSphere ecosystem.

Thanks to DistSQL, users have been empowered to do the following:

  • Create logical databases online.
  • Dynamically configure rules (i.e. sharding, data encryption, read/write splitting, database discovery, shadow DB, and global rules).
  • Adjust storage resources in real-time.
  • Switch transaction types instantly.
  • Turn SQL logs on and off at any time.
  • Preview the SQL routing results.
    At the same time, in the context of increasingly diversified scenarios, more and more DistSQL features are being created and a variety of valuable syntaxes have been gaining popularity among users.
Overview

This post takes data sharding as an example to illustrate DistSQLs application scenarios and related sharding methods.

A series of DistSQL statements are concatenated through practical cases, to give you a complete set of practical DistSQL sharding management schemes, which create and maintain distributed databases through dynamic management. The following DistSQL will be used in this example:

Practical Case

Required scenarios

  • Create two sharding tables: t_order and t_order_item.
  • For both tables, database shards are carried out with the user_id field, and table shards with the order_id field.
  • The number of shards is 2 databases * 3 tables.
    As shown in the figure below:

Setting up the environment

1.Prepare a MySQL database instance for access. Create two new databases: demo_ds_0 and demo_ds_1.

Here we take MySQL as an example, but you can also use PostgreSQL or openGauss databases.

2.Deploy Apache ShardingSphere-Proxy 5.1.2 and Apache ZooKeeper. ZooKeeper acts as a governance center and stores ShardingSphere metadata information.

3.Configure server.yaml in the Proxy conf directory as follows:

mode:
  type: Cluster
  repository:
    type: ZooKeeper
    props:
      namespace: governance_ds
      server-lists: localhost:2181  # ZooKeeper address
      retryIntervalMilliseconds: 500
      timeToLiveSeconds: 60
      maxRetries: 3
      operationTimeoutMilliseconds: 500
  overwrite: false
rules:
  - !AUTHORITY
    users:
      - root@%:root

4.Start ShardingSphere-Proxy and connect it to Proxy using a client, for example:

mysql -h 127.0.0.1 -P 3307 -u root -p

Creating a distributed database

CREATE DATABASE sharding_db;
USE sharding_db;

Adding storage resources
1.Add storage resources corresponding to the prepared MySQL database.

ADD RESOURCE ds_0 (
    HOST=127.0.0.1,
    PORT=3306,
    DB=demo_ds_0,
    USER=root,
    PASSWORD=123456
), ds_1(
    HOST=127.0.0.1,
    PORT=3306,
    DB=demo_ds_1,
    USER=root,
    PASSWORD=123456
);

2.View storage resources

mysql> SHOW DATABASE RESOURCES\G;
*************************** 1. row ***************************
                           name: ds_1
                           type: MySQL
                           host: 127.0.0.1
                           port: 3306
                             db: demo_ds_1
                            -- Omit partial attributes
*************************** 2. row ***************************
                           name: ds_0
                           type: MySQL
                           host: 127.0.0.1
                           port: 3306
                             db: demo_ds_0
                            -- Omit partial attributes

Adding \G to the query statement aims to make the output format more readable, and it is not a must.

Creating sharding rules
ShardingSpheres sharding rules support regular sharding and automatic sharding. Both sharding methods have the same effect. The difference is that the configuration of automatic sharding is more concise, while regular sharding is more flexible and independent.

Please refer to the following links for more details on automatic sharding:

Intro to DistSQL-An Open Source and More Powerful SQL

AutoTable: Your Butler-Like Sharding Configuration Tool

Next, well adopt regular sharding and use the INLINE expression algorithm to implement the sharding scenarios described in the requirements.

Primary key generator

The primary key generator can generate a secure and unique primary key for a data table in a distributed scenario. For details, refer to the document Distributed Primary Key.

1.Create the primary key generator.

CREATE SHARDING KEY GENERATOR snowflake_key_generator (
TYPE(NAME=SNOWFLAKE)
);

2.Query primary key generator

mysql> SHOW SHARDING KEY GENERATORS;
+-------------------------+-----------+-------+
| name                    | type      | props |
+-------------------------+-----------+-------+
| snowflake_key_generator | snowflake | {}    |
+-------------------------+-----------+-------+
1 row in set (0.01 sec)

Sharding algorithm

1.Create a database sharding algorithm, used by t_order and t_order_item in common.

-- Modulo 2 based on user_id in database sharding
CREATE SHARDING ALGORITHM database_inline (
TYPE(NAME=INLINE,PROPERTIES("algorithm-expression"="ds_${user_id % 2}"))
);

2.Create different table shards algorithms for t_order and t_order_item.

-- Modulo 3 based on order_id in table sharding
CREATE SHARDING ALGORITHM t_order_inline (
TYPE(NAME=INLINE,PROPERTIES("algorithm-expression"="t_order_${order_id % 3}"))
);
CREATE SHARDING ALGORITHM t_order_item_inline (
TYPE(NAME=INLINE,PROPERTIES("algorithm-expression"="t_order_item_${order_id % 3}"))
);

3.Query sharding algorithm

mysql> SHOW SHARDING ALGORITHMS;
+---------------------+--------+---------------------------------------------------+
| name                | type   | props                                             |
+---------------------+--------+---------------------------------------------------+
| database_inline     | inline | algorithm-expression=ds_${user_id % 2}            |
| t_order_inline      | inline | algorithm-expression=t_order_${order_id % 3}      |
| t_order_item_inline | inline | algorithm-expression=t_order_item_${order_id % 3} |
+---------------------+--------+---------------------------------------------------+
3 rows in set (0.00 sec)

Default sharding strategy

Sharding strategy consists of sharding key and sharding algorithm. Please refer to Sharding Strategy for its concept.

Sharding strategy consists of databaseStrategy and tableStrategy.

Since t_order and t_order_item have the same database sharding field and sharding algorithm, we create a default strategy that will be used by all shard tables with no sharding strategy configured:

1.Create a default database sharding strategy

CREATE DEFAULT SHARDING DATABASE STRATEGY (
TYPE=STANDARD,SHARDING_COLUMN=user_id,SHARDING_ALGORITHM=database_inline
);

2.Query default strategy

mysql> SHOW DEFAULT SHARDING STRATEGY\G;
*************************** 1. row ***************************
                    name: TABLE
                    type: NONE
         sharding_column:
 sharding_algorithm_name:
 sharding_algorithm_type:
sharding_algorithm_props:
*************************** 2. row ***************************
                    name: DATABASE
                    type: STANDARD
         sharding_column: user_id
 sharding_algorithm_name: database_inline
 sharding_algorithm_type: inline
sharding_algorithm_props: {algorithm-expression=ds_${user_id % 2}}
2 rows in set (0.00 sec)

The default table sharding strategy is not configured, so the default strategy of TABLE is NONE.

Sharding rules

The primary key generator and sharding algorithm are both ready. Now create sharding rules.

1.t_order

CREATE SHARDING TABLE RULE t_order (
DATANODES("ds_${0..1}.t_order_${0..2}"),
TABLE_STRATEGY(TYPE=STANDARD,SHARDING_COLUMN=order_id,SHARDING_ALGORITHM=t_order_inline),
KEY_GENERATE_STRATEGY(COLUMN=order_id,KEY_GENERATOR=snowflake_key_generator)
);

DATANODES specifies the data nodes of shard tables.
TABLE_STRATEGY specifies the table strategy, among which SHARDING_ALGORITHM uses created sharding algorithm t_order_inline;
KEY_GENERATE_STRATEGY specifies the primary key generation strategy of the table. Skip this configuration if primary key generation is not required.

2.t_order_item

CREATE SHARDING TABLE RULE t_order_item (
DATANODES("ds_${0..1}.t_order_item_${0..2}"),
TABLE_STRATEGY(TYPE=STANDARD,SHARDING_COLUMN=order_id,SHARDING_ALGORITHM=t_order_item_inline),
KEY_GENERATE_STRATEGY(COLUMN=order_item_id,KEY_GENERATOR=snowflake_key_generator)
);

3.Query sharding rules

mysql> SHOW SHARDING TABLE RULES\G;
*************************** 1. row ***************************
                            table: t_order
                actual_data_nodes: ds_${0..1}.t_order_${0..2}
              actual_data_sources:
           database_strategy_type: STANDARD
         database_sharding_column: user_id
 database_sharding_algorithm_type: inline
database_sharding_algorithm_props: algorithm-expression=ds_${user_id % 2}
              table_strategy_type: STANDARD
            table_sharding_column: order_id
    table_sharding_algorithm_type: inline
   table_sharding_algorithm_props: algorithm-expression=t_order_${order_id % 3}
              key_generate_column: order_id
               key_generator_type: snowflake
              key_generator_props:
*************************** 2. row ***************************
                            table: t_order_item
                actual_data_nodes: ds_${0..1}.t_order_item_${0..2}
              actual_data_sources:
           database_strategy_type: STANDARD
         database_sharding_column: user_id
 database_sharding_algorithm_type: inline
database_sharding_algorithm_props: algorithm-expression=ds_${user_id % 2}
              table_strategy_type: STANDARD
            table_sharding_column: order_id
    table_sharding_algorithm_type: inline
   table_sharding_algorithm_props: algorithm-expression=t_order_item_${order_id % 3}
              key_generate_column: order_item_id
               key_generator_type: snowflake
              key_generator_props:
2 rows in set (0.00 sec)

So far, the sharding rules for t_order and t_order_item have been configured.

A bit complicated? Well, you can also skip the steps of creating the primary key generator, sharding algorithm, and default strategy, and complete the sharding rules in one step. Let's see how to make it easier.

Syntax
Now, if we have to add a shard table t_order_detail, we can create sharding rules as follows:

CREATE SHARDING TABLE RULE t_order_detail (
DATANODES("ds_${0..1}.t_order_detail_${0..1}"),
DATABASE_STRATEGY(TYPE=STANDARD,SHARDING_COLUMN=user_id,SHARDING_ALGORITHM(TYPE(NAME=INLINE,PROPERTIES("algorithm-expression"="ds_${user_id % 2}")))),
TABLE_STRATEGY(TYPE=STANDARD,SHARDING_COLUMN=order_id,SHARDING_ALGORITHM(TYPE(NAME=INLINE,PROPERTIES("algorithm-expression"="t_order_detail_${order_id % 3}")))),
KEY_GENERATE_STRATEGY(COLUMN=detail_id,TYPE(NAME=snowflake))
);

Note: The above statement specified database sharding strategy, table strategy, and primary key generation strategy, but it didnt use existing algorithms.

Therefore, the DistSQL engine automatically uses the input expression to create an algorithm for the sharding rules of t_order_detail. Now the primary key generator, sharding algorithm, and sharding rules are as follows:

1.Primary key generator

mysql> SHOW SHARDING KEY GENERATORS;
+--------------------------+-----------+-------+
| name                     | type      | props |
+--------------------------+-----------+-------+
| snowflake_key_generator  | snowflake | {}    |
| t_order_detail_snowflake | snowflake | {}    |
+--------------------------+-----------+-------+
2 rows in set (0.00 sec)

2.Sharding algorithm

mysql> SHOW SHARDING ALGORITHMS;
+--------------------------------+--------+-----------------------------------------------------+
| name                           | type   | props                                               |
+--------------------------------+--------+-----------------------------------------------------+
| database_inline                | inline | algorithm-expression=ds_${user_id % 2}              |
| t_order_inline                 | inline | algorithm-expression=t_order_${order_id % 3}        |
| t_order_item_inline            | inline | algorithm-expression=t_order_item_${order_id % 3}   |
| t_order_detail_database_inline | inline | algorithm-expression=ds_${user_id % 2}              |
| t_order_detail_table_inline    | inline | algorithm-expression=t_order_detail_${order_id % 3} |
+--------------------------------+--------+-----------------------------------------------------+
5 rows in set (0.00 sec)

3.Sharding rules

mysql> SHOW SHARDING TABLE RULES\G;
*************************** 1. row ***************************
                            table: t_order
                actual_data_nodes: ds_${0..1}.t_order_${0..2}
              actual_data_sources:
           database_strategy_type: STANDARD
         database_sharding_column: user_id
 database_sharding_algorithm_type: inline
database_sharding_algorithm_props: algorithm-expression=ds_${user_id % 2}
              table_strategy_type: STANDARD
            table_sharding_column: order_id
    table_sharding_algorithm_type: inline
   table_sharding_algorithm_props: algorithm-expression=t_order_${order_id % 3}
              key_generate_column: order_id
               key_generator_type: snowflake
              key_generator_props:
*************************** 2. row ***************************
                            table: t_order_item
                actual_data_nodes: ds_${0..1}.t_order_item_${0..2}
              actual_data_sources:
           database_strategy_type: STANDARD
         database_sharding_column: user_id
 database_sharding_algorithm_type: inline
database_sharding_algorithm_props: algorithm-expression=ds_${user_id % 2}
              table_strategy_type: STANDARD
            table_sharding_column: order_id
    table_sharding_algorithm_type: inline
   table_sharding_algorithm_props: algorithm-expression=t_order_item_${order_id % 3}
              key_generate_column: order_item_id
               key_generator_type: snowflake
              key_generator_props:
*************************** 3. row ***************************
                            table: t_order_detail
                actual_data_nodes: ds_${0..1}.t_order_detail_${0..1}
              actual_data_sources:
           database_strategy_type: STANDARD
         database_sharding_column: user_id
 database_sharding_algorithm_type: inline
database_sharding_algorithm_props: algorithm-expression=ds_${user_id % 2}
              table_strategy_type: STANDARD
            table_sharding_column: order_id
    table_sharding_algorithm_type: inline
   table_sharding_algorithm_props: algorithm-expression=t_order_detail_${order_id % 3}
              key_generate_column: detail_id
               key_generator_type: snowflake
              key_generator_props:
3 rows in set (0.01 sec)

Note: In the CREATE SHARDING TABLE RULE statement, DATABASE_STRATEGY, TABLE_STRATEGY, and KEY_GENERATE_STRATEGY can all reuse existing algorithms.

Alternatively, they can be defined quickly through syntax. The difference is that additional algorithm objects are created. Users can use it flexibly based on scenarios.

After the configuration verification rules are created, you can verify them in the following ways:

Checking node distribution

DistSQL provides SHOW SHARDING TABLE NODES for checking node distribution and users can quickly learn the distribution of shard tables.

We can see the node distribution of the shard table is consistent with what is described in the requirement.

SQL Preview

Previewing SQL is also an easy way to verify configurations. Its syntax is PREVIEW SQL:

1.Query with no shard key with all routes

2.Specify user_id to query with a single database route

3.Specify user_id and order_id with a single table route

Single-table routes scan the least shard tables and offer the highest efficiency.

DistSQL auxiliary query

During the system maintenance, algorithms or storage resources that are no longer in use may need to be released, or resources that need to be released may have been referenced and cannot be deleted. The following DistSQL can solve these problems.

Query unused resources

1.Syntax: SHOW UNUSED RESOURCES

2.Sample:

Query unused primary key generator

1.Syntax: SHOW UNUSED SHARDING KEY GENERATORS

2.Sample:

Query unused sharding algorithm

1.Syntax: SHOW UNUSED SHARDING ALGORITHMS

2.Sample:

Query rules that use the target storage resources

1.Syntax: SHOW RULES USED RESOURCE

2.Sample:

All rules that use the resource can be queried, not limited to the Sharding Rule.

Query sharding rules that use the target primary key generator

1.Syntax: SHOW SHARDING TABLE RULES USED KEY GENERATOR

2.Sample:

Query sharding rules that use the target algorithm

1.Syntax: SHOW SHARDING TABLE RULES USED ALGORITHM

2.Sample:

Conclusion

This post takes the data sharding scenario as an example to introduce DistSQLs applications and methods.

DistSQL provides flexible syntax to help simplify operations. In addition to the INLINE algorithm, DistSQL supports standard sharding, compound sharding, Hint sharding, and custom sharding algorithms. More examples will be covered in the coming future.

If you have any questions or suggestions about Apache ShardingSphere, please feel free to post them on the GitHub Issue list.

Project Links:

ShardingSphere Github

ShardingSphere Twitter

ShardingSphere Slack

Contributor Guide

GitHub Issues

Contributor Guide

References
  1. Concept-DistSQL

  2. Concept-Distributed Primary Key

  3. Concept-Sharding Strategy

  4. Concept INLINE Expression

  5. Built-in Sharding Algorithm

  6. User Manual: DistSQL

Author

Jiang Longtao
SphereEx Middleware R&D Engineer & Apache ShardingSphere Committer.

Longtao focuses on the R&D of DistSQL and related features.

Benefits of Elastic Databases And Common Migration Mistakes to Avoid

Introduction

If you're like most data scientist, analysts, or developers, you probably spend a lot of time worrying about your ever-growing store of data. After all, it's the lifeblood of your business. You need to ensure that it's properly stored and accessible when you need it. This article will discuss the benefits of using elastic databases and common migration mistakes to avoid. We'll also provide some tips on optimizing your database for performance.

A recent survey from IDC, found that one-third of enterprises have already migrated or migrated their applications and data centers to public cloud infrastructures such as Amazon Web Services (AWS) or Microsoft Azure. This is an ongoing trend for companies looking to optimize costs while also preparing for future demands on capacity and efficiency.

However, these migrations don't come without their challenges - ensuring that data integrity is maintained during the transfer between environments.

Elastic Databases To Mitigate Data Risks

One method to mitigate these risks is by using elastic database technologies that support the portability and movement of data sets between SQL Server, Oracle, MySQL, or other data stores. This is where elastic databases come into play. An elastic database is a distributed system that makes it easy to store and access large amounts of data by spreading it across many servers, also known as nodes. It provides transparent scalability with no downtime or interruptions. As a result, you can easily use elastic databases to build web applications in the cloud.

Elastic databases are designed for fast retrieval of data using caching, sharding, and partitioning techniques where data is divided into chunks called elastic units. These elastic units are written to separate partitions on different machines to be read independently. Each elastic unit has metadata that contains the type of unit, its owner (which machine it was assigned), and time of creation.

Using elastic units, elastic databases can scale to process more data at once by creating additional partitions and assigning units to them. The concept behind an elastic database is to take the idea of virtualization, which has led to significant efficiencies for server storage and use in data centers and apply it to databases.

By applying this concept, elastic databases can support multiple concurrent users or workloads while delivering strong performance levels. In addition, with this capability, organizations can migrate their entire databases or subsets of data into the cloud, manage it all within a single cluster, and then move that information back to an on-premise infrastructure.

The application of elastic databases has become increasingly popular because they make migration projects simpler, faster and less expensive than previously possible. This is good news for data professionals looking to take advantage of advances in cloud-based infrastructures while maintaining good performance.

The Benefits Of Elastic Databases

Here are some of the key benefits that elastic database technologies offer during migration projects:

Access to Cloud Environments: By moving data sets into the cloud, businesses can achieve faster time-to-market, higher competitive advantage on new products and services, and lower costs through more efficient resource utilization.

Increased Data Portability: Elastic databases can transfer data between cloud-based environments with little to no performance degradation during the process. As a result, organizations can test their applications in the cloud without the risk of disrupting production accounts.

Increased Database Elasticity: The ability to scale out across multiple cloud environments allows for improved performance, higher levels of data availability, and faster time-to-market. Elastic databases also provide load balancing capabilities that enable companies to distribute workloads across systems without building complex infrastructures automatically.

Reduced Risk of Data Loss: When migrating data, organizations can lose information. However, by moving primary databases into the cloud, this risk is significantly reduced because businesses can have their entire database in one place with its information replicated to other locations. Therefore, if anything happened with one database, all others would remain available.

Greater Database Availability: By placing multiple databases within a single cluster, organizations can achieve higher uptime because more systems are available to ensure workloads are always handled regardless of demand. This also means that there is no need to worry about double provisioning or creating environments for testing purposes. As a result, companies can significantly reduce their operating costs because they will spend less time managing databases and more time working with customers.

Better Organizational Agility: Elastic allows organizations to adjust their resource levels without downtime, enabling businesses to achieve faster time-to-market and increase overall productivity by using existing resources better. In addition, they can easily create new development environments for testing purposes which is a big advantage in building and releasing new products and services.

Common Mistakes When Migrating

Migrating your data after elastic has started

Migration to an elastic database should always be done when it's near or at idle. Once you start migrating, you will see where your current limits are for what elastic can ingest so you can plan for the peaks. It is beneficial to have an elastic start-up in production on a low CPU to see how elastic will perform during peak hours.

Excessive vertical scaling

Sometimes you can have too much elasticity in your application architecture. For example, this can happen when you design a system with only one computing unit because you do not want any downtime. Now you have only one server to scale vertically until it reaches maximum capacity.

This approach gives limited availability and no redundancy in hardware failures unless you implement replication or failover systems yourself.

Elastic databases are designed to scale horizontally and perform better that way. So follow the elastic approach and implement a multi-node elastic architecture for your applications.

Replicating instead of failover

In case of failure on one node, replication is often implemented instead of failover for simplicity. However, this approach slows down your application because it has to reset and reload upon recovery, especially if multiple nodes are in production.

Use elastic databases with failover scripts. Your architecture will be highly available and scalable, so you can just add another node in case of hardware failures or traffic peaks.

Not using backup/recovery

Elastic databases offer a reliable way of backing up data with point-in-time recovery. At the same time, elasticity makes storage space nearly infinite on-demand by adding more nodes whenever migrating to elastic databases.

These days, the most commonly-used backup method for elastic databases is simply copying units from one node to another. Backups work very quickly since each unit contains metadata about its size and location on disk, so only changed portions will get copied during a restore. This makes restoring elastic databases reasonably easy if you have recent backups of all your nodes.

Backup/recovery is included in elastic databases, so you don't have to do it manually or with external tools.

The best way to deal with data loss is to prevent it from happening in the first place. Implement backups at different points in time into your application stack, configure elasticity properly, and elastic databases will take care of the rest.

Incorrect server deployments

During elastic database migrations, it is highly recommended to use indexing and create mapping documents for elastic so that no data is lost during migration. During the migration process, elastic should be hosted on separate servers, and elastic should not use the same host as the elasticsearch node. Elastic should always have dedicated volumes and dedicated network interfaces.

Moving all your data at once

One common mistake when moving to elastic databases is moving all data from the start. While this might seem like a good idea, it often leads to the server database lacking the required capacity and performance when new applications need to use elastic databases.

Once the business value has been identified for elastic databases, you need only migrate the necessary data while also ensuring that it does not impact running applications before or during any migration process.

Not using a migration tool

While elastic databases are similar to relational database management systems, many differences exist. This means that converting existing relational technology into elastic technologies can be extremely difficult, if not impossible, given the level of complexity in some organizations.

Therefore, you must convert your data using elastic database tools that are elastic database ready. ShardingSphere's ElasticJob is an excellent example of an open-source tool that helps you migrate your data to a flexible database. ElasticJob is a distributed scheduling solution consisting of two separate projects, ElasticJob-Lite and ElasticJob-Cloud.

ElasticJob-Lite is a lightweight, decentralized solution that provides distributed task sharding services; ElasticJob-Cloud uses Mesos to manage and isolate resources. In addition, it uses a unified job API for each project. As a result, developers only need code one time and can deploy at will.

Conclusion

As more data is generated every day, organizations need to rethink managing their data infrastructure. This is especially true for Database Management Systems (DBMS), which can be complex and time-consuming to maintain.

As a result, companies must consider solutions that offer more flexibility and additional benefits such as lower operating costs and higher levels of availability. Cloud-based providers are offering these types of database technologies. They have been referred to as elastic databases because they can expand with changing business needs that traditional database solutions cannot.

This allows companies to respond more quickly to market demands without sacrificing the levels of service that their customers expect. In addition, organizations can save money when it comes to purchasing server hardware and software licenses because elastic databases are offered as pay-as-you-go services which are very attractive to businesses of all sizes.

This allows companies to use existing budgetary allocations more efficiently and free up time and funds to focus on other areas that can help them grow their business.

Overview of SQL Programming Language And Trends For 2022

Introduction

As we move into 2022, businesses will continue to look for ways to become more data-driven. This means that knowing how to use Structured Query Language, commonly called SQL or 'Sequel', will remain an essential skill for data scientists, analysts, and developers. In this post, we'll provide an overview of the access language and take a look at some of the most important SQL trends for 2022. So if youre looking to expand your skills to SQL to become a more complete data professional in 2022, then read on!

This overview is designed for those with some technical knowledge but no specific, formal SQL training. After reading this overview, you should be more familiar with the state of SQL today, plus where the language and development community are going in 2022.

History of SQL

IBM researchers Raymond Boyce and Donald Chamberlin created the SQL programming language in the 1970s. Then, following Todd's seminal publication of "A Relational Model of Data for Large Shared Databases," SEQUEL, the programming language, was born.

Todd's idea was that all data in a database should be represented as relations. Based on this hypothesis, Boyce and Chamberlin came up with SQL. According to author Malcolm Coxall in "Oracle Quick Guides (Cornelio Books 2013)," The original SQL version was intended to modify and retrieve data from IBM's first relational database known as "System R."

A few years later, the SQL language became available to developers at-large with Oracle V2, a proprietary variant of the SQL language and commercially released in 1979. Other major database software companies soon followed and developed commercial relational databases using SQL access, such as Ingres, Informix, and Sybase.

It took another 17+ years, in January 1997, for the first open-source database project, PostgreSQL to be released. It was initially named POSTGRES, referring to its origins as a successor to the Ingres database developed at the University of California, Berkeley, and commercialized in the 80s. Then, in 1996, the project was renamed PostgreSQL to reflect its SQL support.

The SQL language has since been designated the preferred language for relational database communication by American National Standard Institute (ANSI) and the International Standards Organization (ISO). While vendors modify the language to suit their needs, most SQL applications are based on the ANSI-approved version.

Why Is it Important To Learn SQL If Working With Data?

Easy to Learn and Use
SQL is praised for its simplicity by using declarative statements like SELECT, in contrast to other programming languages that demand a high-level conceptual knowledge and memorization of the steps required to complete an activity.
It's easier to comprehend than other languages' complex syntax thanks to English words that are simple to understand compared to memorizing strings of numbers and letters in foreign languages.
SQL is a great language to start with if you're new to programming and data science. The small syntax allows you to query data quickly and extract insights from it.

Understanding your dataset
Learning SQL will give you a solid understanding of relational databases and enable you to master the foundations of data science.
SQL will assist you in researching your data thoroughly, visualizing it, and knowing how your data is structured.
It will allow you to discover missing values, identify anomalies, NULLS, and your data format.
SQL will let you play with your data, get thoroughly acquainted with it, and learn how the values are distributed and organized due to slicing, filtering, aggregations, and sorting.

Integrates with scripting languages
As a data scientist, you must be careful in representing your data to comprehensible it to your team or organization. You must carefully display your data in such a manner that others readily understand it.
Other programming languages like R and Python work well with SQL. You may simply integrate a code package as a stored procedure to easily mix SQL and Python to do your task comfortably.
Also, specialized connection libraries for SQL such as SQLite and MySQLdb can be quite beneficial in connecting a client app to your database engine, allowing you to interact with your data.

Manages huge volumes of data
Today, most data analytics deals with massive amounts of data stored in relational databases. Working with such quantities of data necessitates using higher-level solutions to manage it rather than simple spreadsheets. As the amount of datasets grows, using spreadsheets becomes unreasonable. Data in the relational model and managed by SQL is the best solution for dealing with vast volumes of data.
When working with relational databases, you don't have to be concerned about pools of data in SQL. Instead, it can query and give useful insights from the data.
Get Started Learning SQL
Github is a great place to find a free, open-source database to install, practice, and learn SQL. There are many open-source SQL database management systems available. For example, SQLite is a popular free option and uses less system resources than a full implementation of Oracle, for example.

Once you have installed an open-source SQL database, choose a SQL tutorial that suits your needs. You can find some online courses on platforms like Udemy or Coursera. In addition, you can watch free videos on YouTube. Another option is to buy a book about SQL programming. It will be easier if the book is written for your particular open source SQL database management system. Here is a SQLite book on Amazon.

Then, try to write queries on your own. It is not easy at first, but you will get better with practice. Do not hesitate to ask questions on GitHub discussion forums, Slack channels, or Reddit if you have any doubts. Open-source communities are known for being generous in sharing knowledge and 'paying it forward.

High-Level Features And Commands

Here is an overview of frequently used SQL commands and features.

  1. Data Definition Language (DDL): It contains commands which define the data. The commands are:
    create: It is used to create a table.
    drop: It is used to delete the table, including all the attributes.
    alter: It is a reserve word that modifies the table's structure.
    rename: A table name can be changed using the reserver 'rename'

  2. Data Manipulation Language (DML): It contains commands to manipulate the data. The commands are:
    select: get data from a database table
    insert: This command is generally used after the create command to insert a set of values into the tables.
    delete: A command used to delete particular tuples or rows or cardinality from the table.
    update: It updates the tuples in a table.
    Triggers: Triggers are actions performed when certain conditions are met on the data.
    A trigger contains three parts:
    event The change in the database that activates the trigger is an event.
    condition A query or test run when the trigger is activated.
    action A procedure executed when the trigger is activated and the condition met is true.

  3. Client-server execution and remote database access: Separates the data (server) from the application (client). A database will generally have one or more servers, which provide the storage space, manage the data, create backups, etc. The clients are software programs running on a user's computer.

  4. Security and authentication: Rather than allowing the user to access every aspect of the database, SQL allows users to perform operations on specific tables such as retrieving information about certain fields, sorting it into different orders, updating specific records, or deleting them altogether. This limits data access to specifically authorized tables rather than the whole database.

  5. Embedded SQL: SQL provides the feature of embedding host languages such as C, COBOL, Java for query from their language at runtime.

  6. Transaction Control Language: Transactions are a critical element of DBMS, and to control the transactions, TCL is used with commands like commit, rollback, and savepoint.
    commit: Saves the database at any point whenever the database is consistent.
    rollback: Rollback/undo to the previous point of the transaction.
    savepoint: Goes back to the previous transaction without going back to the entire transaction.

  7. Advanced SQL: The current features include OOP ones like recursive queries, decision-supporting queries, and query supporting areas like data mining, spatial data, and XML(Xtensible Markup Language).

Future of SQL

Agnostic Queries
SQL agnostic query tools are becoming more popular due to their flexibility when working with data in multiple formats across multiple databases. ShardingSphere's open-source SQL Parser database plug-in is an excellent example of this trend.

It allows users to work with data regardless of the underlying open-source SQL database. In addition, you can use this tool to parse SQL queries, allowing developers to easily analyze and modify their code. ShardingSphere's SQL Parser is designed with any open-source SQL database in mind, making it a versatile tool for developers.

It can be used to quickly parse complex SQL queries, making it easier to understand and modify the code. Additionally, the tool can generate execution plans for queries, helping developers optimize their queries for performance.

Open-Source SQL
Open-source versions of SQL and SQL databases are continuing to outpace industry giants Microsoft and Oracle for the new class of cloud-based, data-powered apps being built. This is due, in part, to the open-source community development model, which can quickly adapt and improve products using the collective 'wisdom of the crowd'. Additionally, open-source solutions are often more affordable than their proprietary counterparts.

For example, the open-source database PostgreSQL has seen significant growth in recent years. According to the latest figures from DB-Engines, PostgreSQL is now the fourth most popular database in the world, after MySQL, Microsoft SQL Server, and Oracle Database. This growth is largely because PostgreSQL is open-source, mature, and freely available for download. But it's also just as powerful and feature-rich as a commercial Oracle database.

Democratization of Data Access
SQL is a powerful language that can help you collaborate and write reports. While it can be intimidating for beginners, more non-technical users are starting to learn SQL to get the most out of their data.

For example, product managers and analysts use SQL to analyze their data. They might still lean on Excel spreadsheets or Google Docs, but they know it's time to learn SQL if they find themselves creating more complicated queries against ever-larget datasets.

SQL is open source, which means it's available for anyone to use and customize. This makes it a popular choice for analysts and developers who want to build their reports and applications.

SQL Collaboration
Open source tools like DBeaver are increasing in popularity for collaborative SQL development. These tools make it easy to share your work with other developers and make changes and corrections collaboratively. This can save time and improve the quality of your SQL code.

DBeaver is a free, open-source tool available for Windows, Mac, and Linux. It supports all of the most popular open-source databases, including MySQL, PostgreSQL, Oracle, Amazon Redshift, H2, DerbyDB, and more.

Collaborative SQL development is an efficient way to work with other developers. Working on open-source tools like DBeaver can save time for programmers by allowing them to share code and corrections easily. Additionally, open-source tools are frequently updated with new features, making it easier for developers to keep up with the latest trends. For these reasons, open-source collaborative SQL development is likely to increase in popularity in the years to come.

Conclusion

As the world of technology continues to change, so do programming languages. SQL is a universal language that can be used by programmers and non-programmers alike. It's easy for people without computer science backgrounds to get started with coding because it does not require any previous experience or knowledge of other programming languages like JavaScript or Python. This article has shown you why there are plenty of reasons to learn how to code in this popular language--and what steps you should take next!

How to Choose the Best Encryption Methods for Databases

Introduction

Encryption is the process of encoding messages or information so that only authorized parties can see it. Encryption has been going on for centuries. For example, in World War II Allied forces were sending out encrypted codes using the unwritten Navajo language, which was impossible for the Japanese to decode.

Today encryption is even more important because we live in an era where privacy and security are constantly under attack from hackers who want access into our personal lives. Thanks to modern technology like AES encryption, there's no reason why hackers should read sensitive information.

The Future Trends Driving Open-Source Database Programs

Introduction

There are now thousands of options for deciding what open source project to choose for in-house development or what project to join as a contributor.

According to the open-source Wikipedia page, there are “more than 180,000 open-source projects available and more than 1,400 unique licenses, the complexity of deciding how to manage open-source use within ‘closed-source’ commercial enterprises has dramatically increased.”

There are infinite combinations for what an open-source tech stack might look like for a given project. However, the petabytes of data these apps produce will end up in one location: Databases. And specifically, open-source databases, tools, and middleware to optimize and access that data. 

So, if you want to back a winner in choosing a project to contribute to or build upon, open-source database programs are a good bet. 

What is driving the shift from proprietary, vendor-driven innovation to the open-source model for data management? We outline the top trends shaping the future of open source database programs. And if you have not started by joining an open-source project, we highlight a few example projects where you can help lead the way.

Cloud Computing and SaaS

The one-two punch of open source code repositories and cloud computing have permanently disrupted tech innovation. ‘Data is new oil’ has collided with ‘software is eating the world’ to create the SaaS industry. And it’s all in the cloud. Developers can scale services to fit their needs, customize applications, and access cloud services from anywhere, on any device with an internet connection. For users, SaaS software is already developed ‘out of the box’ and automatically paired with a database where the expectation is that any action or query performed in the software will get an instant response whether on desktop or mobile. 

In a sense, this all works like magic to users but to developers, it’s a daily challenge to come up with new computing models and code to keep it all running. Open-source communities have become the engine that drives this innovation.

Under this relatively new computing model, developers can get applications to market quickly, without heavy investments in infrastructure costs and maintenance. This was simply not possible 20 years ago. It’s turning virtually every company into a software-driven company. The cloud, paired with open source databases, gives developers access to the innovative storage and access technologies available to the Amazons and Googles of the world, beyond what proprietary vendors can possibly do to keep up. In its latest quarter reported in September 2021, Oracle’s proprietary database license revenue was down 8% over last year, but its cloud business was up 40%.

Take traditional banks with their proprietary databases and code. Suddenly they find themselves competing for share-of-wallet with challenger banks like Chime. Chime only exists in the cloud (they have no branches) and present itself as a mobile-first banking SaaS app. They have built a billion-dollar company using a cloud-first, open-source strategy that established banks are scrambling to equal.

Hybrid Cloud

Hybrid Cloud is where the open-source databases will shine. As the name implies, hybrid clouds use a combination of on-premises, private cloud, and third-party cloud services with orchestration between the platforms. Common applications work between the models. This configuration has grown recently to support the need by enterprises to keep certain workloads on-premises while enjoying the benefits the cloud has to offer. 

Open source allows them to have a common set of tools, even if the databases are different, that are adapted to all environments for use cases such as disaster and recovery and balancing workloads.

Big Data 

The rapidly increasing volume and complexity of data are due to growing mobile traffic, cloud-computing traffic, and the proliferation of new technologies including IoT and AI. According to Research And Market’s 2020 report on data usage, over 2.5 quintillions (that’s 1018) bytes of data are generated every day. The need is for constant innovation in data storage and retrieval to keep up. Open source is showing itself to be the best approach to innovation in data science, along with hardware optimization and coding efficiency.

Data is the most crucial reason open-source database projects will remain among the most popular (second only to operating systems).  It keeps piling up, and new applications like IoT and social media produce vast amounts of it daily, and in high volume, at high velocity, and highly variable formats.

There need to be ways to analyze all this data, at scale. This is where open-source databases are outperforming traditional ones. Even though the dynamics of open source and community-developed projects have changed in recent years, communal development is still the best way to promote innovation in data access for 99% of the applications out there.

Database Agnostic Tools and Middleware

The new reality is a technology universe where firms have a more distributed approach to database services. The need is to explore multiple database instances from various database vendors to create hybrids that can be hosted on-prem, cloud or both, but simplified by a standard set of tools and middleware to access the data.

Whether the database is SQL or NoSQL there is a commonality underpinning modern relational databases in tables, rows, and columns. This evergreen structure allows new access tools to emerge that unite data distributed across hybrid infrastructures. For example, open-source solutions such as ShardingSphere have SQL agnostic tools to query and retrieve data across distributed data stores.

Call it a fear of commitment on the part of customers, but the trend is that customers don’t want to be locked into a single massive vendor like Microsoft or IBM anymore.

Source Code Access

Vendor lock-in was considered a good thing 30 years ago. You had one ‘neck to choke’ if something went wrong, and the vendor had a staff of engineers on hand to patch and update continually. You also paid a fortune for the licenses and a hefty annual fee to support and upgrade the software. A single Oracle-based app in the enterprise can have a lifetime cost running into the millions of dollars, just for the database software and maintenance, not including development.

Vendor lock-in is no longer cool in the era of agile development and ‘break things fast’ code sprints. The key to open-source is the ‘source’ part. Having access to communally curated source code allows developers to make changes based on their needs and priorities, not of the software vendor. 

Open Source Communities

Access to the source code is not just valuable to a single developer or company. It’s valuable to the entire ecosystem of open-source community members. It’s a virtuous cycle where the more people contribute to make the software stable and useful, the more people will join a project, and so on.

Much like how a good blog post or a tweet spreads virally, great open-source software leverages network effects. It is the community that is the source of promotion for that virality. Dozens of contributor communities and thousands of developers worldwide are happily iterating open-source code for database projects built natively as distributed systems from the ground up. 

If worked on by a diverse community, this approach leads to more stable software than a single team of developers hacking away at bugs in a proprietary system. Examples of forward-thinking open-source database communities include Apache ShardingSphere, CockroachDB, Yugabite, and ClickHouse. For example, ShardingSphere has over 450 contributors to its open-source codebase in Asia alone and is spreading rapidly around the world. Significant growth in contribution and adoption for database software is certain in the coming years as companies demand ever greater access, speed, and control of their growing data streams.

Momentum is on the side of open-source database projects. There are rich communities of open source contributors who worked out the major flaws of 0.x versions, create libraries, repositories, documentation, and even YouTube videos to ‘pay it forward’ for new contributors. A 2020 survey by O’Reilly Media and IBM polled 3,400 developers and tech managers in 2020. The survey reported:


  • 94% of respondents rated open-source software equal or better than proprietary software. 
  • 70% of respondents preferred open-source cloud providers.
  • 65%  of respondents agreed that contributing to open-source projects results in better professional opportunities.

The Virtual Software Catalog

Open-source apps are easy to find. Do a Google search for an app like ‘open-source data back up and restore’ and the top result shows there at least 17 of these in the top search result, ‘The Top 17 Free and Open Source Backup Solutions’, and probably many more.

As of Jan 2020, GitHub reports having 40+ million users and 190+ million repositories (including 28 million public repositories).  You can search GitHub in many ways, for example, GitHub supports advanced search in certain fields, like repository title, description, and README. If you want to find some cool repository to learn database stuff, you can search like this: in:name database.

You are more likely to find a relevant solution ready to modify in the open-source software universe than in a proprietary product suite. 

CIOs and Security

More than two-thirds of CIOs are concerned about losing their freedom to cloud providers. As a result, this has become another main driver of open-source database adoption.

In the age of ransomware, data security is a matter of survival for enterprises. Open-source technology enables organizations to take complete control over their security needs by providing full access to source code and configuration to extend the software however they like.

There is certainly a counter-argument to the security of open-source.  Rapid adoption by enterprises seems to be settling the argument in favor of open-source.  No company will remain untouched by the power of open-source database progress.

Emerging Examples

As this multiverse progresses, we will experience a tapestry of emerging technologies that will become the top choice for open-source database programs. 

As an example, the emerging open-source ecosystem provided by Apache Shardingsphere with their technology agnostic distributed database plug and play modules provides ease of use and adoption for developers as contributors or in the enterprise. 

Kubernetes (the cloud container orchestration technology by Google is now open source) is another key API of choice for open-source database deployments that programmers can benefit from. 

In the next 5 years, open-source development will be driven by necessity. Industry trends predict that open-source won’t be an option, it will be the standard. As a result, corporations will have to embrace it to stay relevant.

How to Develop Your Distributed SQL Statement in Apache ShardingSphere

In the previous articles “An Introduction to DistSQL” and “Integrating SCTL Into DistSQL’s RAL— Making Apache ShardingSphere Perfect for Database Management”, the Apache ShardingSphere committers shared the motivations behind the development of DistSQL, explained its syntax system, and impressively showcased how you can use just one SQL to create a sharding table.

Today, to help you gain a better understanding of DistSQL and develop your own DistSQL syntax, our community author analyzes the design & development process of DistSQL and showcases how you can implement a brand new DistSQL grammar in four stages of the development life cycle (i.e. demand analysis, design, development & testing).

Integrating SCTL Into DistSQL’s RAL

In the previous article “An Introduction to DistSQL” written by Haoran Meng, the Apache ShardingSphere Committer shared the motivating reasons behind the design of DistSQL, explained its syntax system, and impressively showcased how you can use one SQL to create a sharding table.

Recently, the ShardingSphere community has redesigned the SCTL grammar and the execution engine, integrating SCTL into the DistSQL syntax system. Now RAL contains the old SCTL function, making ShardingSphere’s command language even more convenient for database management. Today, our community author would like to introduce the changes and elaborate on how you can use the new RAL command lines. We always pursue a better user experience, and the upgrade we developed this time is just another typical example.