How to Run SQL Queries With Presto on Google BigQuery

Presto has evolved into a unified SQL engine on top of cloud data lakes for both interactive queries as well as batch workloads with multiple data sources. This tutorial will show you how to run SQL queries with Presto (running with Kubernetes) on Google BigQuery.

Presto’s BigQuery connector allows querying the data stored in BigQuery. This can be used to join data between different systems like BigQuery and Hive. The connector uses the BigQuery Storage API to read the data from the tables.

Utilizing BigQuery as A Data Warehouse in A Distributed Application

Introduction

Data plays an integral part in any organization. With the data-driven nature of modern organizations, almost all businesses and their technological decisions are based on the available data. Let's assume that we have an application distributed across multiple servers in different regions of a cloud service provider, and we need to store that application data in a centralized location. The ideal solution for that would be to use some type of database. However, traditional databases are ill-suited to handle extremely large datasets and lack the features that would help data analysis. In that kind of situation, we will need a proper data warehousing solution like Google BigQuery.

What is Google BigQuery?

BigQuery is an enterprise-grade, fully managed data warehousing solution that is a part of the Google Cloud Platform. It is designed to store and query massive data sets while enabling users to manage data via the BigQuery data manipulation language (DML) based on the standard SQL dialect.

Using MySQL as a Cache Layer for BigQuery

Cache layer

BigQuery is great at handling large datasets but will never give you a sub-second response, even on small datasets. It leads to a wait time on dashboards and charts, especially dynamic, where users can select different date ranges or change filters. It is almost always okay for internal BIs but not for customer-facing analytics. We tolerate a lot of things such as poor UI and performance in internal tools, but not in those we ship to customers.

But we still can leverage BigQuery’s cheap data storage and the power to process large datasets while not giving up on the performance. As BigQuery acts as a single source of truth and stores all the raw data, MySQL can act as cache layer on top of it and store only small, aggregated tables and provide us with a desired sub-second response.