What is Cost-based Optimization?

In our previous blog posts (1, 2), we explored how query optimizers may consider different equivalent plans to find the best execution path. But what does it mean for the plan to be the "best" in the first place? In this blog post, we will discuss what a plan cost is and how it can be used to drive optimizer decisions.

Example

Consider the following query:

Memoization in Cost-based Optimizers

Query optimization is an expensive process that needs to explore multiple alternative ways to execute the query. The query optimization problem is NP-hard, and the number of possible plans grows exponentially with the query’s complexity. For example, a typical TPC-H query may have up to several thousand possible join orders, 2–3 algorithms per join, a couple of access methods per table, some filter/aggregate pushdown alternatives, etc. Combined, this could quickly explode the search space to millions of alternative plans.

This blog post will discuss memoization — an important technique that allows cost-based optimizers to consider billions of alternative plans in a reasonable time.

Rule-Based Query Optimization

The goal of the query optimizer is to find the query execution plan that computes the requested result efficiently. In this blog post, we discuss rule-based optimization - a common pattern to explore equivalent plans used by modern optimizers. Then we explore the implementation of several state-of-the-art rule-based optimizers. Then we analyze the rule-based optimization in Apache Calcite, Presto, and CockroachDB.

Transformations

A query optimizer must explore the space equivalent execution plans and pick the optimal one. Intuitively, plan B is equivalent to plan A if it produces the same result for all possible inputs.

Assembling a Query Optimizer with Apache Calcite

Introduction

Apache Calcite is a dynamic data management framework with SQL parser, optimizer, executor, and JDBC driver.

Many examples of Apache Calcite usage demonstrate the end-to-end execution of queries using JDBC driver, some built-in optimization rules, and the Enumerable executor. Our customers often have their own execution engines and JDBC drivers. So how to use Apache Calcite for query optimization only, without its JDBC driver and Enumerable executor?