Compare Benefits of CPUs, GPUs, and FPGAs for Different oneAPI Compute Workloads

Introduction

oneAPI is an open, unified programming model designed to simplify development and deployment of data-centric workloads across central processing units (CPUs), graphics processing units (GPUs), field-programmable gate arrays (FPGAs), and other accelerators. In a heterogeneous compute environment, developers need to understand the capabilities and limitations of each compute architecture to effectively match the appropriate workload to each compute device.

In this article will:

Software AI Accelerators: AI Performance Boost for Free

This is a republish of a blog on VentureBeat by Wei Li, Intel VP/GM, AI and Analytics (AIA).

The exponential growth of data has fed artificial intelligence’s voracious appetite and led to its transformation from niche to omnipresent. An equally important aspect of this AI growth equation is the ever-expanding demands it places on computer system requirements to deliver higher AI performance. This has not only led to AI acceleration being incorporated into common chip architectures such as CPUs, GPUs, and FPGAs but also mushroomed a class of dedicated hardware AI accelerators specifically designed to accelerate artificial neural networks and machine learning applications. While these hardware accelerators can deliver impressive AI performance improvements, software AI accelerators are required to deliver even higher orders of magnitude AI performance gains across deep learning, classical machine learning, and graph analytics, for the same hardware set-up. What’s more is that this AI performance boost driven by software optimizations is free, requiring almost no code changes or developer time and no additional hardware costs.

Vectorization in LLVM and GCC for Intel CPUs and GPUs

Introduction

Modern CPU and GPU cores use single instruction, multiple data (SIMD) execution units to achieve higher performance and power efficiency. The underlying SIMD hardware is exposed via instructions such as SSE, AVX, AVX2, AVX-512, and those in the Intel® Xe Architecture Gen12 ISA. While using these directly is an option, their low-level nature severely limits portability and proves unattractive for most projects.

To provide a more portable and easier to use interface for programmers, three avenues are explored in this article: auto-vectorization, programmer-guided SIMD vectorization through language constructs or programmer hints, and a SIMD data-parallel library approach. We provide an overview of these methods and show SIMD vectorization evolution in the LLVM and GCC compilers through code examples. We also examine a couple of vectorization techniques in the LLVM and GCC compilers to achieve optimal performance on Intel® Xeon® processors and Intel Xe Architecture GPUs.

Efficient Heterogeneous Parallel Programming Using OpenMP

In some cases, offloading computations to an accelerator like a GPU means that the host CPU sits idle until the offloaded computations are finished. However, using the CPU and GPU resources simultaneously can improve the performance of an application. In OpenMP® programs that take advantage of heterogenous parallelism, the master clause can be used to exploit simultaneous CPU and GPU execution. In this article, we will show you how to do CPU+GPU asynchronous calculation using OpenMP.

The SPEC ACCEL 514.pomriq MRI reconstruction benchmark is written in C and parallelized using OpenMP. It can offload some calculations to accelerators for heterogenous parallel execution. In this article, we divide the computation between the host CPU and a discrete Intel® GPU such that both processors are kept busy. We’ll also use Intel VTune™ Profiler to measure CPU and GPU utilization and analyze performance.

oneAPI – The Cross-Architecture, Multi-Vendor Path to Accelerated Computing

Accelerator Adoption Will Thrive With Software Standardization

Accelerator technologies are receiving more attention throughout the computing infrastructure, from the endpoint to the data center. User needs, ranging from managing the explosion of data to time-critical business processes, have driven this interest for exponentially greater computation in a slowly growing energy budget.

While the conversation for the past decade focused on the use of programmable graphics accelerators (GPUs), accelerator architectures today are far more diverse. Specialized accelerators for artificial intelligence (AI), high-performance computing (HPC), cryptography, data movement and IO, and telecommunications currently deployed or in development use a diverse set of architectures: superscalar, vector, dataflow/spatial, matrix, as well as emerging neuromorphic and quantum. Beyond architectural diversity, accelerator implementations also vary from the traditional IO connected devices to coherent memory devices to devices tightly integrated into the CPU complex.