alluxio | The Blog Pros

April 11, 2019

Store One Billion Files in Alluxio 2.0

Alluxio is a virtual distributed file system that enables applications to access files and objects in different external storage like S3 or HDFS in a unified file system namespace with a single API. Scaling the capacity of Alluxio metadata service is vital to Alluxio for a couple of reasons:

Alluxio provides a single namespace where multiple storage systems can be mounted. So the size of Alluxio's namespace needs to match the sum of the sizes of all mounted storages.
Object storage is increasing in popularity, and object stores often hold many more small files compared with file systems like HDFS.

In Alluxio 1.x, the metadata service is limited to around 200 million files in practice. Scaling further would cause garbage collection issues due to the limited JVM heap size of the Alluxio master process. Also, storing 200 million files would require a large memory footprint (around 200GB) of JVM heap in a single machine running Alluxio master.

April 3, 2019

Getting Started With Alluxio and Spark in 5 Minutes

Co-authored by Alex Ma.

Introduction

Apache Spark has brought significant innovation to Big Data computing, but its results are even more extraordinary when paired with Alluxio. Alluxio provides Spark with a reliable data sharing layer, enabling Spark to excel at performing application logic while Alluxio handles storage. Bazaarvoice uses the combination of Spark and Alluxio to provide a real-time big data platform that has the ability to not only handle the intake of 1.5 billion page views during peak events like Black Friday but also provides real-time analytics against it (read more). At this scale, the gain in speed is an enabler for new workloads. We’ve established a clean and simple way to integrate Alluxio and Spark.

February 21, 2019

Testing Distributed Systems With Docker and AWS for the Cost of a Large Pizza

Testing distributed systems at scale is typically a costly yet necessary process. At Alluxio, we take testing very seriously as organizations across the world rely on our technology, therefore, a problem we want to solve is how to test at scale without breaking the bank. In this blog, we are going to show how the maintainers of the Alluxio open source project build and test our system at scale cost-effectively using public cloud infrastructure. We test with the most popular frameworks, such as Spark and Hive, and pervasive storage systems, such as HDFS and S3. Using Amazon AWS EC2, we are able to test 1000+ worker clusters, at a cost of about $16 per hour.

Read the full-length Technical White Paper if you are interested in the following takeaways as this blog is an abbreviated version:

January 11, 2019July 10, 2019

Top 10 Tips for Making the Spark + Alluxio Stack Blazing Fast

The Apache Spark + Alluxio stack is getting quite popular particularly for the unification of data access across S3 and HDFS. In addition, compute and storage are increasingly being separated causing larger latencies for queries. Alluxio is leveraged as compute-side virtual storage to improve performance. But to get the best performance, like any technology stack, you need to follow the best practices. This article provides the top 10 tips for performance tuning for real-world workloads when running Spark on Alluxio with data locality, giving the most bang for the buck.

A Note on Data Locality

High data locality can greatly improve the performance of Spark jobs. When data locality is achieved, Spark tasks can read in-Alluxio data from local Alluxio workers at memory speed (when ramdisk is configured) instead of transferring the data over the network. The first few tips are related to locality.