Getting Started With Alluxio and Spark in 5 Minutes

Co-authored by Alex Ma.

Introduction

Apache Spark has brought significant innovation to Big Data computing, but its results are even more extraordinary when paired with Alluxio. Alluxio provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. Bazaarvoice uses the combination of Spark and Alluxio to provide a real-time big data platform that has the ability to not only handle the intake of 1.5 billion page views during peak events like Black Friday but also provides real-time analytics against it (read more). At this scale, the gain in speed is an enabler for new workloads. We’ve established a clean and simple way to integrate Alluxio and Spark.

Testing Distributed Systems With Docker and AWS for the Cost of a Large Pizza

Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio, we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog, we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.

Read the full-length Technical White Paper if you are interested in the following takeaways as this blog is an abbreviated version:

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices. This article provides the top 10 tips for performance tuning for real-world workloads when running Spark on Alluxio with data locality, giving the most bang for the buck.

A Note on Data Locality

High data locality can greatly improve the performance of Spark jobs. When data locality is achieved, Spark tasks can read in-Alluxio data from local Alluxio workers at memory speed (when ramdisk is configured) instead of transferring the data over the network. The first few tips are related to locality.