Petastorm: A Simple Approach to Deep Learning Models in Apache Parquet Format

Petastorm, an open-source data access library, enables single-node or distributed training as well as evaluation of deep learning models precisely from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. As Andrey, a U.S.-based Python engineer, notes, it supports popular Python-based machine learning (ML) frameworks including Tensorflow, PyTorch, and PySpark. For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

Petastorm enables either single machine or distributed training, as well as support for multiple Python-based ML frameworks such as NumPy, Tensorflow, Theano, Pytorch, and PySpark. It is the go-to library for the evaluation of deep learning models using Apache Parquet formatted datasets.