Optimizing InfiniBand Bandwidth Utilization for NVIDIA DGX Systems Using Software RAID Solutions

Objectives

Modern AI innovations require proper infrastructure, especially concerning data throughput and storage capabilities. While GPUs drive faster results, legacy storage solutions often lag behind, causing inefficient resource utilization and extended times in completing the project. Traditional enterprise storage or HPC-focused parallel file systems are costly and challenging to manage for AI-scale deployments. High-performance storage systems can significantly reduce AI model training time. Delays in data access can also impact AI model accuracy, highlighting the critical role of storage performance.

Xinnor partnered with DELTA Computer Products GMBH, a leading system integrator in Germany, to build a high-performance solution designed specifically for AI and HPC tasks. Thanks to the use of high-performance NVMe drives from Micron, efficient software RAID from Xinnor, and 400Gbit InfiniBand controllers from NVIDIA, the system designed by Delta ensures a high level of performance through NFSoRDMA interfaces, both for read and write operations, that is crucial for reducing checkpoint times typical of AI projects and for handling possible drive failures. NFSoRDMA enables parallel access for reading and writing from multiple nodes simultaneously. The 2U dual sockets server used by Delta and equipped with 24x 7450 NVMe 15.36 from Micron allows storage of up to 368TB and provides theoretical access speeds of up to 50GBps. In this document, we’ll explain how to set up the system with xiRAID to saturate the InfiniBand bandwidth and provide the best possible performance to NVIDIA DGX H100 systems.

CategoriesUncategorized