Hands-On Presto Tutorial: Presto 105

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial, we will show you how to run Presto with AWS Glue as a catalog on a laptop.

AWS Data Pipeline vs Glue vs Lambda: Who Is a Clear Winner?

AWS provides users with some of the most effective ETL tools for streamlined data management. Whether you are willing to implement a new platform, undertake third-party integrations, or simply move all your data to a warehouse, these ETL tools help you in managing your database in a secure and private manner. 

However, it is important to select the right ETL tool for AWS depending on your specific requirements. Here, we would compare three of such tools – AWS Data Pipeline, AWS Glue, and AWS Lambda. 

Setting Up Dev Endpoint Using Apache Zeppelin With AWS Glue

AWS Glue Logo

AWS Glue is a powerful tool that is managed, relieving you of the hassle associated with maintaining the infrastructure. It is hosted by AWS and offers Glue as Serverless ETL, which converts the code into Python/Scala and executes it in a Spark environment.

AWS Glue provisions all the required resources (Spark cluster) at runtime to execute the Spark Job which takes ~7-10 mins and then starts executing your actual ETL code. To reduce this time, AWS Glues provides a development endpoint, which can be configured in Apache Zeppelin (provisioned with the spark environment) to interactively, run, debug and test ETL code before deploying as Glue job or scheduling the ETL process.

How to Set Up a Data Lake Architecture With AWS

Before we get down to the brass tacks, it’s helpful to quickly list out what the specific benefits we want an ideal data lake to deliver. These would be:

  • The ability to collect any form of data, from anywhere within an enterprise’s numerous data sources and silos. From revenue numbers to social media streams, and anything in between.
  • Reduce the effort needed to analyze or process the same data set for different purposes by different applications.
  • Keep the whole operation cost efficient, with the ability to scale up storage and compute capacities as required, and independent of each other.

And with those requirements in mind, let’s see how to set up a data lake with AWS