aws glue | The Blog Pros

October 21, 2021

Hands-On Presto Tutorial: Presto 105

Introduction

This is the 5th tutorial in our Getting Started with Presto series. To recap, here are the first 4 tutorials:

Presto is an open source distributed parallel query SQL engine that runs on a cluster of nodes. In this tutorial, we will show you how to run Presto with AWS Glue as a catalog on a laptop.

April 13, 2021

Setting Up Dev Endpoint Using Apache Zeppelin With AWS Glue

AWS Glue is a powerful tool that is managed, relieving you of the hassle associated with maintaining the infrastructure. It is hosted by AWS and offers Glue as Serverless ETL, which converts the code into Python/Scala and executes it in a Spark environment.

AWS Glue provisions all the required resources (Spark cluster) at runtime to execute the Spark Job which takes ~7-10 mins and then starts executing your actual ETL code. To reduce this time, AWS Glues provides a development endpoint, which can be configured in Apache Zeppelin (provisioned with the spark environment) to interactively, run, debug and test ETL code before deploying as Glue job or scheduling the ETL process.

October 7, 2019

How to Set Up a Data Lake Architecture With AWS

Before we get down to the brass tacks, it’s helpful to quickly list out what the specific benefits we want an ideal data lake to deliver. These would be:

The ability to collect any form of data, from anywhere within an enterprise’s numerous data sources and silos. From revenue numbers to social media streams, and anything in between.
Reduce the effort needed to analyze or process the same data set for different purposes by different applications.
Keep the whole operation cost efficient, with the ability to scale up storage and compute capacities as required, and independent of each other.

And with those requirements in mind, let’s see how to set up a data lake with AWS