Background

In the race to exascale computing, energy consumption is the biggest challenge. One way of keeping energy consumption within budget is to embrace heterogeneous computing where more than one type of processor architecture is used. The current world’s fastest supercomputer, IBM® Summit, is a heterogeneous system with 4608 IBM Power® System AC922 compute nodes (based on the IBM POWER9™ processor technology) and 27,648 NVIDIA Volta graphics processing units (GPUs). There is a consistent trend towards the use of heterogeneous, accelerator-based systems. Today, 30% of the top 500 supercomputers are accelerator-based heterogeneous systems. Figure 1 shows the trend of accelerator-based supercomputers over the last decade.

Figure 1. Heterogeneous (accelerator-based) systems in Top500 list (Source: https://www.top500.org/statistics/list/)
alt

Over the last decade, many HPC applications have been enabled to run on GPUs. It is observed that, if programmed properly, the accelerators (GPUs) can speed up an application’s performance by Nx where N could be 2, 5, 10, or more. It is a wide range and depends on the type of application, algorithm, hardware and so on.

Introduction

GPUs get tremendous compute power from their thousands of tiny processing cores. They can run thousands of threads in parallel, hence they are best suited for data parallel work or single-instruction multiple-data (SIMD) work. NVIDIA CUDA programming is a popular GPU programming model. You must to be knowledgeable in this type of programming to develop applications that can run on GPUs and it requires restructuring of applications to optimize for GPUs.

OpenMP GPU programming is an alternative for CUDA programming. This directive-based technique is already well known for shared memory parallelization on CPUs and is easy to learn and implement in application programs. It also offers a path to more portable GPU-accelerated software. One of the goals of the OpenMP standard is to minimize the need for your programs to contain GPU vendor-specific statements. This makes the codes portable across different GPU architectures.

The OpenMP standard 4.0 and later versions, have introduced pragmas for C, C++, and FORTRAN programming languages to offload work on general purpose GPUs. You can find details about the usage of OpenMP GPU offloading in the OpenMP specifications [7] and GPU offloading examples on the internet. The following papers [1], [2], [3] explain the usage of GPU offloading pragmas. In this article, we have tried to assess the benefit of GPU offloading using OpenMP on memory and compute-intensive applications on an IBM Power AC922 server with four NVIDIA Tesla V100 GPUs with 16 GB memory each. We used memory-intensive triad code and compute-intensive matrix multiplication GPU offloaded OpenMP programs. The matrix multiplication clearly starts showing the benefit of GPU offloading after reaching large sizes of matrices as shown in Figure 3.

The OpenMP device constructs are used to offload work on GPUs. The target construct is required to specify a region to be launched on the device. Target data maps the variables on the device. While the teams pragma inside target spawns the set of teams with multiple OpenMP threads. The distribute construct partitions the iterations and maps it to each team. The following lists summarize the pragmas required for GPU offloading.

Device constructs:

  • omp target data
  • omp target
  • omp target update
  • omp declare target
  • omp teams
  • omp distribute
  • omp distribute parallel for
  • omp declare target
  • Combined or nested constructs

Offloading enhancements in 4.5 specification:

  • firstprivate, private, defaultmap added to target construct
  • Enhancement to map clause
  • if clause for combined directives
  • Implicit firstprivate
  • omp target enter data
  • omp target exit data
  • omp target parallel
  • nowait and depend clauses added to target construct
  • omp target simd

To offload work on GPU, the compiler should have enabled support for GPU offloading as well as the GPU vendor should provide the necessary interface (libraries) to interact with the GPU. In our experimentation, we used IBM XLC and CLANG compilers to compile GPU offloaded OpenMP triad and matrix multiplication codes. IBM XLC 13.1.5 and CLANG 6.0 compilers gained basic support for OpenMP 4.0 specification of GPU offloading (to NVIDIA GPUs) while IBM XLC 16.1.1 and CLANG 7.0 onwards started supporting OpenMP 4.5 specification of GPU offloading [9]. Both compilers use NVIDIA interfaces to launch GPU kernels and transfer data between CPU and GPU as and when required.

Table 1 shows the compilation flags used to compile the OpenMP GPU offload program using the XLC and CLANG compilers. To inform the compiler to offload work on GPU, that is, to enable the compiler to generate GPU-specific code, use the -qsmp=omp & -qoffload command with XLC and -fopenmp with the CLANG compiler. -qtgtarch or -fopenmp-targets for XLC and CLANG respectively specifies the target GPU architecture. Sm_70 represents NVIDIA Tesla V100 GPU architecture while nvptx64-nvidia-cuda indicates to use the NVIDIA device offloading toolchain for the 64-bit platform. When using NVIDIA GPUs, the program should be linked with the CUDA runtime library (libcudart.so) which is normally installed at the CUDA installation directory (/usr/local/cuda/lib64/) during CUDA toolkit installation.

Table 1. Compilation flags used to compile the OpenMP GPU offload program using XLC and CLANG

GPU offloading using XLC compiler GPU offloading using CLANG compiler
#xlc -o ompGO main.c gpuoffload.c -qsmp=omp -qoffload -qtgtarch=sm_70 -lcudart -L/usr/local/cuda/lib64 #clang -o ompGO main.c gpuoffload.c -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -lcudart -L/usr/local/cuda/lib64

OpenMP GPU offload observations

The triad program is a memory-intensive benchmark which is originally designed to stress the CPU memory subsystem. Typical, a triad operation is performed with three single-dimensional arrays and involves two flops per iteration as shown below.

A[i] = B[i] + constant * C[i]

Listing 1 shows the triad code using OpenMP to offload work on GPU. Figure 2 shows time required to complete the triad operation inclusive of CPU and GPU for different array sizes varying from the smallest to the largest possible size that can fit into GPU memory.

Listing 1. Triad operation offloaded to GPU using OpenMP pragma

    #pragma omp target data map (to: c[0:N], b[0:N]) map(tofrom: a[0:N])
    #pragma omp target teams distribute parallel for
    for (j=0; j<N; j++)
    {
        a[j] = b[j]+scalar*c[j];
    }

Figure 2. Triad operation on CPU and GPU
alt

The GPU triad operation is dominated by memory transfers between CPU and GPU. There is a very minimal compute part on GPU and hardly any overlapping of computation and communication. Though GPU compute part could finish very quickly, the overall application performance doesn’t benefit by GPU compute power.

As against triad program, the matrix multiplication is memory and compute intensive. Typical 2D matrix multiplication requires three arrays to store two input and one resultant matrix and involves 2N flops per element calculation.

Listing 2 shows the matrix multiplication code using OpenMP to offload work on GPU. Figure 3 shows the time required to complete matrix multiplication inclusive of CPU and GPU for different matrix sizes.

Listing 2. Matrix multiplication offloaded to GPU using OpenMP pragma

    #pragma omp target data map (to: pA[0:N*N],pB[0:N*N]) map (tofrom: pC[0:N*N])
    #pragma omp target
    #pragma omp teams distribute parallel for collapse(2) private(i,j,k)
    for(i=0;i<N;i++)
    {
        for(j=0;j<N;j++)
        {
            for(k=0;k<N;k++)
            {
                pC(i,j)+=pA(i,k)*pB(k,j);
            }
        }
    }

Note: All three matrices have been allocated as single dimensional arrays and provided access through macro for convenience such as #define pA(i,j) (pA[((i)*N) + (j)])

Figure 3. Matrix multiplication on CPU and GPU
alt

The graph in Figure 3 shows that offloading smaller matrices on GPU don’t benefit the overall performance of the application. But, as the size of the problem increase, the compute part increases by O(N3) in matrix multiplication. So roughly around the matrix size of 1300 x 1300, the performance of the CPU matches with that of GPU. For bigger matrices, the data transfer between CPU and GPU becomes insignificant and offloading to GPU clearly gives performance benefit.

Monitoring OpenMP GPU offload

Use the top command to monitor running jobs on CPU. Similarly, use the nvidia-smi command to monitor NVIDIA GPU activities when the work is offloaded on GPUs. The nvidia-smi command shows the GPUs that are engaged in computation, its occupancy (utilization), memory consumption, and so on. Figure 4 shows the output of the nvidia-smi command while running the matrix multiplication program.

Note: The nvidia-smi tool is part of the CUDA toolkit and would be available on the system after installing the CUDA toolkit.

Figure 4. nvidia-smi output captured while running GPU offloaded OpenMP matrix multiplication
alt

The numastat tool (which is part of the numactl package) can also be used to check the memory statistics for a process running on CPU and GPU. It shows the total memory, used memory, and free memory, Non-Uniform Memory Access (NUMA) hit and miss per NUMA node, that is, in case of heterogeneous system it gives information from CPU and GPU both. When used with the -m option, it shows the meminfo-like system-wide memory usage information. With the -p option, you can specify the process ID to check the per-node memory allocation information for the specified process. Figure 5 shows the output of the numastat command while running the matrix multiplication program. In the numastat output, Node 0 and Node 8 are the CPU nodes while GPUs are shown as Nodes 252 to 255. From the system configuration, we know that Node 255 is mapped to GPU0. Its memory footprint has increased to 1455 MB due to a running process (matrix multiplication) on GPU0.

Figure 5. numastat output captured while running GPU offloaded OpenMP matrix multiplication
alt

Summary

The era of heterogeneous systems is here and the IBM Power AC922 GPU-accelerated system is at the leading edge of this new era in high-performance computing (HPC). OpenMP’s open standard, portability across different GPU architectures, ease of programming, and wide acceptance (on CPU shared memory programming) makes it a promising programming model for heterogeneous systems. In this article, we explained how to use the OpenMP pragma, compile the OpenMP GPU offload program on an IBM POWER9 processor-based system using XLC and CLANG compilers and tools to monitor OpenMP GPU offload program. We also tried to showcase the behavior of memory-intensive and compute-intensive OpenMP GPU offload program using triad and matrix multiplication example. While writing a GPU offloaded OpenMP program, programmers should be familiar with the parallel for, teams, and distribute clauses. These are essential to spread parallelism. Collapsing the loops also increases parallelism. Try to minimize CPU-GPU data transfer or overlap it with GPU computation.

References

  1. Hands on with OpenMP 4.5 and Unified Memory: Developing applications for IBM’s hybrid CPU + GPU systems (part I)
  2. Hands on with OpenMP 4.5 and Unified Memory: Developing applications for IBM’s hybrid CPU + GPU systems (part II)
  3. Performance Evaluation of OpenMP’s Target Construct on GPUs – Exploring Compiler Optimizations
  4. Best practice guide – GPGPU
  5. CUDA C best practices guide
  6. OpenMP on GPUs, first experiences and best practices
  7. A full description of OpenMP 4.5 data and programming constructs
  8. Code optimization with IBM XL compilers on Power architectures
  9. OpenMP compilers and tools support