Speeding Up the AI Supercomputing Platform – Practice at Unisound

Unisound has built the industry-leading GPU/CPU heterogeneous computing and distributed file system, called Atlas. This platform provides AI applications with high-performance computing and data access capabilities at a massive scale. Based on the Kubernetes open-source architecture, the Unisound team has developed the core features and successfully built an AI supercomputing platform with a floating-point processing capacity of more than 10 PFLOPS (100 million times per second). The platform supports the main machine learning frameworks, and developers can efficiently research and develop core applications such as voice, NLP, big data, multimodal, etc. The platform also serves external customers such as SMBs and research institutions with customized computing and storage capabilities.

Problems and Challenges

On the Atlas platform, computation is decoupled from storage. At present, the interconnections among the storage servers, the computing servers, and between the computing and storage servers are 100GB InfiniBand.