Turn Cloud Storage or HDFS Into Your Local File System for Faster AI Model Training With TensorFlow

Users today have a variety of options of cost-effective and scalable storage for their Big Data or machine learning applications, from the distributed storage system like HDFS, Ceph to cloud storage like AWS S3, Azure Blob store, and Google Cloud Storage. These storage technologies have their own APIs. This means that developers need to constantly learn new storage APIs and develop their code using these APIs. In some cases, for example, in machine learning/deep learning workloads, the frameworks don’t have integrations to all the needed storage-level APIs, and a lot of data engineering needs to be done to move the data around. It has become common practice to move data sets from the HDFS data lake to the local compute instances of the data scientist to achieve data locality and access data via the local file system.

This article aims to provide a different approach to help connect and make distributed files systems like HDFS or cloud storage systems look like a local file system to data processing frameworks: the Alluxio POSIX API. To explain the approach better, we used the TensorFlow + Alluxio + AWS S3 stack as an example.