SQL on Kafka With Presto (Video)

Presto is a state of the art Distributed SQL Query Engine for big data, enabling efficient querying on cold data and various data sources. With extended SQL language and features like geospatial queries, joins between different data sources (SQL to join data from HDFS, Elasticsearch, and Kafka anyone?), and the ability to run on containers and cheap servers, Presto is slowly becoming the standard ad-hoc querying engine for big data.

In this talk, we will present Presto and how it can be used with Kafka. We will discuss data architectures, Presto features and why is it so good for your data, and finally see how it can be leveraged to querying data from Kafka as well as executing a single SQL statement that joins data from Kafka on data from SQL, Cassandra, Elastic, and more.

Database Fundamentals #21: Using the JOIN Operator, OUTER JOIN

The OUTER JOIN returns one complete set of data and then the matching values from the other set. The syntax is basically the same as INNER JOIN but you have to include whether or not you're dealing with a RIGHT or a LEFT JOIN. The OUTER word, just like the INNER key word, is not required.

OUTER JOIN

Imagine a situation where you have a list of people. Some of those people have financial transactions, but some do not. If you want a query that lists all people in the system, including those with financial transactions, the query might look like this:

Amazon Review Data: Spotting Trends and Fake Reviews

Introduction

Amazon is the leading provider of cloud computing and has a number of interesting open data sets which you can experiment with. I wanted to try them out with a new product my company has developed, so I've been looking at those data sets. One of the most recognizable is Amazon's own review data which is documented at https://registry.opendata.aws/amazon-reviews/

Approach

I wanted to see what sort of questions the review data could answer. At the end of this article, in "How I did it," I'll show the steps to access the data, but for now you will just need to know some SQL to follow the queries used.