Running Alluxio-Presto Sandbox in Docker

The Alluxio-Presto sandbox is a Docker application featuring installations of MySQL, Hadoop, Hive, Presto, and Alluxio. The sandbox lets you easily dive into an interactive environment where you can explore Alluxio, run queries with Presto, and see the performance benefits of using Alluxio in a big data software stack.

In this guide, we’ll be using Presto and Alluxio to showcase how Alluxio can improve Presto’s query performance by caching our data locally so that it can be accessed at memory speed!

Four Different Ways to Write in Alluxio

Alluxio is an open-source data orchestration system for analytics and AI workloads. Distributed applications like Apache Spark or Apache Hive can access Alluxio through its HDFS-compatible interface without code change. We refer to external storage such as HDFS or S3 as under storage. Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.

Given an application such as a Spark job that saves its output to an external storage service; Writing the job output to the memory layer in a colocated Alluxio worker will achieve the best write performance. Due to the volatility of memory, when a node in Alluxio goes down or restarts, any data in that node’s memory is lost. To prevent data loss, Alluxio provides the ability to write the data to the persistent under storage either synchronously or asynchronously by configuring client-side Write Types. Each Write Type has benefits and drawbacks associated with it. Applications that write to Alluxio storage should consider the different write types and perform a cost-benefit analysis to determine the write type that is best suited for the application requirements.

Testing Distributed Systems With Docker and AWS for the Cost of a Large Pizza

Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio, we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog, we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.

Read the full-length Technical White Paper if you are interested in the following takeaways as this blog is an abbreviated version: