Best Practices for Resource Management in PrestoDB

Resource management in databases allows administrators to have control over resources and assign a priority to sessions, ensuring the most critical transactions get a significant share of system resources. Resource management in a distributed environment makes accessibility of data more accessible and manages resources over the network of autonomous computers (i.e., Distributed systems). The basis of resource management in the distributed system is also resource sharing.

PrestoDB is a distributed query engine written by Facebook as the successor to Hive for the highly scalable processing of large volumes of data. Written for the Hadoop ecosystem, PrestoDB is built to scale to tens of thousands of nodes and process petabytes of data. To be usable at a production scale, PrestoDB was built to serve thousands of queries to multiple users without facing bottle-necking and “noisy neighbor” issues. PrestoDB makes use of resource groups in order to organize how different workloads are prioritized. This post discusses some of the paradigms that PrestoDB introduces with resource groups, as well as best practices and considerations to think about before setting up a production system with resource grouping.

Querying Kafka Topics Using Presto

Presto is a distributed query engine that allows querying different data sources such as Kafka, MySQL, MongoDB, Oracle, Cassandra, Hive, etc. using SQL. It has the ability to analyze big data and query multiple data sources together.

In this article, we will discuss how Presto can be used to query Kafka topics. Below is the step-by-step process to set up Presto and Kafka, and connect them together. Here, I have considered MacOS, but similar setups can be done on any other system.

Data Federation With CockroachDB and Presto

Motivation

A customer inquired whether data federation is possible natively in CockroachDB. Unfortunately, CockroachDB does not support features like foreign data wrappers and such. A quick search returned a slew of possibilities and Presto being a prominent choice, sparked my interest.

High-Level Steps

  • Install Presto
  • Configure Postgresql catalog
  • Configure TPCH catalog
  • Verify
  • Wrap up

Step by Step Instructions

Install Presto

I'm using a Mac and luckily there's a homebrew package available.

Scaling With Presto on Spark

This blog was co-written with Shradha Ambekar, Staff Software Engineer at Intuit and Ariel Weisberg, Software Engineer at Facebook.

Overview

Presto was originally designed to run interactive queries against data warehouses, but now it has evolved into a unified SQL engine on top of open data lake analytics for both interactive and batch workloads. Popular workloads on data lakes include:

Tutorial: How to Define SQL Functions With Presto Across All Connectors

Presto is the open-source SQL query engine for data lakes. It supports many native functions which are usually sufficient for most use cases. However, there is maybe a corner case where you need to implement your own function. To simplify this, Presto allows users to define expressions as SQL functions. These are dynamic functions separated from the Presto source code, managed by a functions namespace manager that you can set up with a MySQL database. In fact, this is one of the most widely used features of Presto at Facebook, with over 1000s of functions defined.

Function Namespace Manager

A function namespace is a special catalog.schema that stores functions in the format like mysql.test. Each catalog.schema can be a function namespace. A function namespace manager is a plugin that manages a set of these function catalog schemas. The catalog can be mapped to connectors in Presto (a connector for functions, no tables or view) and allows the Presto engine to perform actions such as creating, altering, and deleting functions.

Hands-On Presto Tutorial: Presto 105

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial, we will show you how to run Presto with AWS Glue as a catalog on a laptop.

Hands on Presto Tutorial: Presto 103

Introduction

This tutorial is Part III of our getting started with Presto series. As a reminder, PrestoDB is an open source distributed SQL query engine. In tutorial 101, we showed you how to install and configure presto locally, and in tutorial 102, we covered how to run a three-node PrestoDB cluster on a laptop. In this tutorial, we’ll show you how to run a PrestoDB cluster in a GCP environment using VM instances and GKE containers.

Environment

This guide was developed on GCP VM instances and GKE containers.

Tutorial: How to Run SQL Queries With Presto on Amazon Redshift

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show how to run SQL queries with Presto (running on Kubernetes) on AWS Redshift.

Presto's Redshift connector allows querying the data stored in an external Amazon Redshift cluster. This can be used to join data between different systems like Redshift and Hive, or between two different Redshift clusters. 

How to Run SQL Queries With Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show you how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Presto’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Hands-on Presto Tutorial: Presto 101

In this blog we'll show you how to get started with Presto, the open source SQL query engine for the data lake. By the end you'll be able to run Presto locally on your machine.

Presto Installation

Presto can be installed manually or using docker images on:

How Carbon Uses PrestoDB With Ahana to Power Real-Time Customer Dashboards

The author, Jordan Hoggart, was not compensated by Ahana for this review.

The Background

At the base of Carbon’s real-time, first-party data platform is our analytics component, which combines a range of behavioral, contextual, and revenue data, which is then displayed within a dashboard in a series of charts, graphs, and breakdowns to give a visual representation of the most important actionable data. Whilst we pre-calculate as much of the information as possible, there are different filters that allow users to drill deeper into the data, which makes querying critical.

What is Cost-based Optimization?

In our previous blog posts (1, 2), we explored how query optimizers may consider different equivalent plans to find the best execution path. But what does it mean for the plan to be the "best" in the first place? In this blog post, we will discuss what a plan cost is and how it can be used to drive optimizer decisions.

Example

Consider the following query:

Inside Presto Optimizer

Abstract

Presto is an open-source distributed SQL query engine for big data. Presto provides a connector API to interact with different data sources, including RDBMSs, NoSQL products, Hadoop, and stream processing systems. Created by Facebook, Presto received wide adoption by the open-source world (Presto, Trino) commercial companies (e.g., Ahana, Qubole).

Presto comes with a sophisticated query optimizer that applies various rewrites to the query plan. In this blog post series, we investigate the internals of Presto optimizer. In the first part, we discuss the optimizer interface and the design of the rule-based optimizer.

Presto and Open Analytics

So you have multiple data sources and most of your data has ended up in the cloud, or will soon. It’s a mix of structured and unstructured, static and streaming, in many different formats, and often fractured across data warehouses, open source databases, proprietary databases, data lakes, document stores and object storage like Amazon S3. How best to unify access to that data and share it with internal and external applications and users to support the ever-widening variety of analytical and operational use cases?  

It sounds challenging, but there are several solutions. One option is to try and consolidate all the data by moving it into a monolithic (or yet another) database, cloud warehouse or data lake. But this is often impractical, time-consuming, and likely to increase cost, effort, and vendor lock-in. Let’s face it, adding another vendor to the mix is not high on the CIO’s to-do list.