Configuring RaptorX Multi-Level Caching With Presto

RaptorX and Presto: Background and Context

Meta introduced a multi-level cache at PrestoCon 2021, the open-source community conference for Presto. Code-named the “RaptorX Project,” it aims to make Presto 10x faster on Meta-scale petabyte workloads. This is a unique and very powerful feature only available in PrestoDB and not any other versions or forks of the Presto project.

Presto is the open-source SQL query engine for data analytics and the data lakehouse. It enables you to scale independently and reduce costs. However, storage-compute disaggregation also brings new challenges for query latency as scanning huge amounts of data between the storage tier and the compute tier is going to be IO-bound over the network. As with any database, optimized I/O is a critical concern to Presto. When possible, the priority is to not perform any I/O at all. This means that memory utilization and caching structures are of utmost importance.

How to Query Your AWS S3 Bucket With Presto SQL

As more structured, semi-structured, and unstructured data get stored in AWS S3, it gets harder to use that data to help with critical business needs.

The problem lies within the ability to run a successful, cost-efficient query. Running a query across different data types, as a combination of semi/un or structured on the cloud, becomes expensive. And the price jumps when you need to construct, manage and integrate various types of information to obtain results.

Best Practices for Resource Management in PrestoDB

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most critical transactions get a significant share of system resources. Resource management in a distributed environment makes accessibility of data more accessible and manages resources over the network of autonomous computers (i.e., Distributed systems). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to Hive for the highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. To be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups, as well as best practices and considerations to think about before setting up a production system with resource grouping.

Scaling With Presto on Spark

This blog was co-written with Shradha Ambekar, Staff Software Engineer at Intuit and Ariel Weisberg, Software Engineer at Facebook.

Overview

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:

Tutorial: How to Define SQL Functions With Presto Across All Connectors

Presto is the open-source SQL query engine for data lakes. It supports many native functions which are usually sufficient for most use cases. However, there is maybe a corner case where you need to implement your own function. To simplify this, Presto allows users to define expressions as SQL functions. These are dynamic functions separated from the Presto source code, managed by a functions namespace manager that you can set up with a MySQL database. In fact, this is one of the most widely used features of Presto at Facebook, with over 1000s of functions defined.

Function Namespace Manager

A function namespace is a special catalog.schema that stores functions in the format like mysql.test. Each catalog.schema can be a function namespace. A function namespace manager is a plugin that manages a set of these function catalog schemas. The catalog can be mapped to connectors in Presto (a connector for functions, no tables or view) and allows the Presto engine to perform actions such as creating, altering, and deleting functions.

Tutorial: How to Run SQL Queries With Presto on Amazon Redshift

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show how to run SQL queries with Presto (running on Kubernetes) on AWS Redshift.

Presto's Redshift connector allows querying the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

How to Run SQL Queries With Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show you how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Presto’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.