Four Different Ways to Write in Alluxio

Alluxio is an open-source data orchestration system for analytics and AI workloads. Distributed applications like Apache Spark or Apache Hive can access Alluxio through its HDFS-compatible interface without code change. We refer to external storage such as HDFS or S3 as under storage. Alluxio is a new layer on top of under storage systems that can not only improve raw I/O performance but also enables applications flexible options to read, write and manage files. This article focuses on describing different ways to write files to Alluxio, realizing the tradeoffs in performance, consistency, and also the level of fault tolerance compared to HDFS.

Given an application such as a Spark job that saves its output to an external storage service; Writing the job output to the memory layer in a colocated Alluxio worker will achieve the best write performance. Due to the volatility of memory, when a node in Alluxio goes down or restarts, any data in that node’s memory is lost. To prevent data loss, Alluxio provides the ability to write the data to the persistent under storage either synchronously or asynchronously by configuring client-side Write Types. Each Write Type has benefits and drawbacks associated with it. Applications that write to Alluxio storage should consider the different write types and perform a cost-benefit analysis to determine the write type that is best suited for the application requirements.