## Introduction

Today, we are in the era of heterogeneous systems. The world’s number one supercomputer IBM Summit is the biggest cluster of heterogeneous nodes. Each node is an IBM® Power® System AC922 dual-socket server with 22 cores with three NVIDIA Volta GPUs per processor chip.

With a gradual surge in heterogeneous clusters, high-performance computing (HPC) applications have started focusing on using accelerators (or GPUs) on the heterogeneous node. Figure 1 represents a heterogeneous cluster showing two nodes with CPUs (C0, C1) and GPUs (G0 to G5). Hybrid programming is the general paradigm for most of the HPC applications to scale across multiple nodes in a symmetric multiprocessing (SMP) cluster. A combination of Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) is widely used in hybrid programming to utilize inter-node and intra-node scalability respectively.

##### Figure 1. Nodes in a heterogeneous cluster

Traditionally, CUDA and OpenCL are used for GPU programming. The directive-based programming models such that OpenACC and OpenMP are alternatives to traditional GPU programming. OpenMP 4.0 standard started supporting GPU offloading directives. Because of the early integration by some of the key compiler manufacturers such as IBM, Cray, Intel® and so on, GPU programming using OpenMP has started getting traction in a hybrid programming paradigm in a heterogeneous cluster [15], [16].

In our previous article we showcased ease of programming using OpenMP to offload work on a single GPU. In this article, we demonstrate the ease of programming in regard to data partitioning and GPU scalability using OpenMP GPU offloading of MPI and non-MPI applications. In our case study example, we observed that the MPI and non-MPI codes scale on multiple GPUs and multi-fold speed up can be achieved when compared to CPU-only performance with very minimal code changes. In an MPI application, the communication time dominates the total run time of the application for smaller matrices while its impact reduces in case of bigger matrices due to higher compute intensity.

In the HPC domain, MPI has in fact become a standard for distributed memory models to scale applications on multiple nodes of a cluster. In typical MPI programming, data is partitioned and work is distributed among processes that run on the same or separate nodes in a cluster. Data exchanges among processes are carried out by explicit calls to MPI point-to-point or collective APIs.

OpenMP is a popular programming model for parallel programming on shared memory architecture where work is distributed among threads. You may need to logically divide domain to distribute work among threads but there is no need for explicit data transfer as all memory is accessible to all CPU threads.

The physical or logical data partitioning strategy depends on the type of computational domain, amount of data exchange across a domain boundary, communication frequency, type of data structure used, and so on. Figure 2 is a representational diagram showing different ways of data partitioning of the computational domain where each domain can be assigned to a separate compute unit (CU) (for example, node, core, or accelerator).

##### Figure 2. Different ways to partition computation domain

In our case study, we experimented a simple matrix multiplication algorithm* with GPU offloading using OpenMP within MPI and non-MPI programs. Figure 3 is a diagrammatic representation of matrix multiplication where row (matrix A) is multiplied by column (matrix B) to produce a resultant element of matrix C.

##### Figure 3. Matrix multiplication (diagrammatic representation)

***Note:** Users can refer to Engineering and Scientific Subroutine Library (ESSL) or cuBLAS libraries for standard matrix operations algorithms.

Figure 4 displays the flow of the non-MPI matrix multiplication program using OpenMP. It shows three stages, namely initialization, problem solving, and result collection. It also compares and highlights the changes (in blue) in CPU only, single GPU, and multi-GPU implementations.

In Figure 4, the first column (a) shows a typical shared memory CPU-only OpenMP implementation where matrices A and B are initialized at the beginning. The OpenMP pragma `#pragma omp parallel for collapse(2)`

adds parallelism at the CPU thread level. The resulting matrix C is computed by all threads.

In Figure 4 the second column (b) shows single GPU offloading using the OpenMP pragma. The changes in the OpenMP pragma are highlighted in blue color. The GPU offloading device construct `target`

specifies a region to be launched on the device. The `team`

construct creates teams with one master thread per team, the `parallel for`

construct creates multiple OpenMP threads for each team, and the `distribute`

construct partitions the iterations to distribute among teams and threads. The `data map`

clause copies the variable declared in CPU memory to the device (GPU) memory or the other way round depending on the direction of transfer. All teams of threads run on the GPU, specified by the `device(#GPU ID)`

keyword.

##### Figure 4. Non-MPI matrix multiplication program using OpenMP

##### (a) CPU-only OpenMP implementation (b) Single GPU offloading using OpemMP (c) Multi-GPU offloading using OpenMP

## Scaling code on multiple GPUs (non-MPI)

The current OpenMP specification does not define a mechanism to distribute work among all available GPUs so the programmer needs to explicitly distribute work among multiple GPUs. In our example, we used a data partitioning strategy as shown in Figure 5 for MPI and non-MPI code. We split the input matrix A and the result matrix C into equal sections such that each GPU receives one portion of matrix A and C. The second input matrix B is shared across all GPUs. Listing 1 shows the non-MPI implementation of multiple GPU offloading using the OpenMP `sections`

clause. The code segment under each section runs in parallel asynchronously. At the end of the `sections`

clause, all parallel tasks synchronize. Listing 2 shows a generic implementation of the matrix multiplication function which can map different sections of arrays on multiple GPUs. The `data map`

clause is used to split and map matrices on the respective GPUs and aggregate the result.

##### Figure 5. Matrix multiplication – data partitioned and distributed across 4 different GPUs

In Figure 4, the third column (c) shows multi-GPU (two GPUs) offloading using the OpenMP program. The input matrix A and result matrix C are split into two parts A0, A1 and C0, C1 respectively and mapped to the respective GPUs. All GPUs perform matrix multiplication on their part of matrices. The sections of the result matrix are mapped back on CPU memory through the from clause. No need for explicit aggregation operation. Note that the OpenMP GPU offloading syntax remains the same in single or multiple GPU offloading.

## Scaling code on multiple GPUs (MPI)

Figure 6 displays the flow of the MPI matrix multiplication program using OpenMP. It compares and highlights the changes (in blue) in CPU-only and multi-GPU implementations.

In Figure 6, the first column (a) shows a typical CPU only MPI implementation where each MPI task is mapped to one processor. The master task initializes the input matrices A and B, splits matrix A and C using the `MPI_Scatter`

API to distribute them among the available processors, and broadcasts matrix B using the `MPI_Broadcast`

API. Each MPI task calls the matrix multiplication function and sends part of the resulting matrix to the master task. The `MPI_Gather`

API aggregates data from all of the tasks. The CPU thread-level parallelism is implemented inside the matrix multiplication function using OpenMP. Listing 3 shows the MPI implementation of matrix multiplication.

In Figure 6, the second column (b) shows multi-GPU (two GPUs) offloading using OpenMP in the MPI program. The OpenMP pragma is highlighted in blue. As work distribution is inherent to the MPI programming model, there is no need to write additional work distribution or data partition logic. The MPI tasks are mapped on all available GPUs using the OpenMP `data map`

clause. The `to`

and `from`

keywords in the map clause move data between the CPU and the GPU. All GPUs perform matrix multiplication on their part of matrices.

##### Figure 6. MPI matrix multiplication program using OpenMP

##### (a) CPU-only OpenMP implementation (b) Multi-GPU offloading using OpenMP

## Observations

Table 1 shows the instructions used to compile our code on the Power AC922 server using spectrum MPI and XLC compiler.

##### Table 1. Compilation of MPI and non-MPI matrix multiplication using OpenMP GPU offloading

MPI matrix multiplication using OpenMP GPU offloading |
Non-MPI matrix multiplication using OpenMP GPU offloading |
---|---|

`mpicc -o mpi_ompGO mpi_mm.c mm.c gpuoffload.c -qsmp=omp -qoffload -qtgtarch=sm_70 -lcudart -L/usr/local/cuda/lib64` |
`xlc -o ompGO main.c gpuoffload.c -qsmp=omp -qoffload -qtgtarch=sm_70 -lcudart -L/usr/local/cuda/lib64` |

`-O3 -qhot -mcpu=power9 -mtune=power9 -maltivec` |
`-O3 -qhot -mcpu=power9 -mtune=power9 -maltivec` |

### GPU scalability

We noticed, as against traditional GPU programming, offloading on single GPU is easy and requires minimal efforts and code changes for MPI and non-MPI implementation. In the case of scaling application on more than one GPU, work needs to be distributed among the available GPUs. We think, achieving GPU scalability in MPI application is relatively easier than non-MPI application as data partitioning and work distribution among tasks is inherent to the MPI programming paradigm. Each MPI task can be mapped to a single GPU. The mapped task offloads its portion of work on the respective GPU. In our MPI implementation, with minimal addition of code for the task and GPU mapping, we could scale the code on more than one GPU. In the case of non-MPI implementation, the programmer must implement the logic for data partitioning and distribute the work on GPUs. In our non-MPI implementation, we see approximately 20% additional code changes compared to the single GPU implementation to scale code on multiple GPUs. Figure 7 shows GPU scalability and speed up achieved as compared to the CPU-only performance.

##### Figure 7. MPI matrix multiplication GPU speedup

### Communication time

Figure 8 shows CPU, GPU, and communication time for OpenMP GPU offloaded MPI and non-MPI matrix multiplication code. The MPI implementation takes relatively more time because of the additional time spent in communication, while the time spent on CPU and GPU remain the same in both cases. When we experimented with different matrix sizes (refer to Figure 9), we observed that the relative communication time gradually decreased as matrix size increased due to higher compute intensity.

##### Figure 8. CPU, GPU, and communication time in MPI and non-MPI program

##### Figure 9. MPI communication time with respect to matrix sizes

## Monitoring GPUs in a heterogeneous cluster

The nvidia-smi and numastat tools can be used to monitor GPU activities and memory statistics respectively on individual nodes. We have explained these tools in our previous article.

In a heterogeneous cluster environment, IBM Spectrum® LSF® provides the facility to enable GPU features and monitor GPU resources. The commands, such as bjobs, bhists, andbacct with the -gpu option, display GPU resource usage information from finished jobs. The commands, such as bhosts, lshosts, andlsload with the -gpu option, displays GPU utilization, memory used, state, errors, temperature, and so on [5] [6]. Figure 10, Figure 11, and Figure 12 show the output of lsload and bhosts commands while running GPU-enabled MPI application on two Power AC922 server nodes with NVIDIA V100 GPUs.

Further, IBM Spectrum LSF can integrate and enable NVIDIA’s data center GPU manager (DCGM) [7] [8] to configure, monitor health, and diagnose any GPU-related issues.

##### Figure 10. Monitoring GPUs in a heterogeneous environment using lsload ( -gpu option)

##### Figure 11. Monitoring GPUs in a heterogeneous environment using lsload (-gpuload option)

##### Figure 12. Monitoring GPUs in a heterogeneous environment using bhosts

## Summary

Heterogeneous systems are marking a new epoch in the race to exascale computing. Programmers should develop or enable applications to run on any type of computing unit in heterogeneous systems. The OpenMP community has made conscious efforts to support generic (non-vendor specific) mode of parallel programming using CPUs and GPUs. We showcased benefits and drawbacks while implementing OpenMP GPU offloading for MPI and non-MPI applications. Any application can scale on multiple GPUs using OpenMP GPU offloading pragmas with minimal development effort.

## References

- A full description of OpenMP 4.5 data and programming constructs
- Code optimization with IBM XL compilers on Power architectures
- Engineering and Scientific Subroutine Library (ESSL) version 6.2
- IBM Spectrum MPI
- Monitoring GPU resources using IBM Spectrum LSF
- Enabling GPU features in IBM Spectrum LSF
- Integrating IBM LSF scheduler and NVIDIA DCGM
- NVIDIA Data Center GPU Manager Simplifies Cluster Administration
- CUDA Basic Linear Algebra Subroutine Library (cuBLAS)
- Experiences Porting Mini-applications to OpenACC and OpenMP on Heterogeneous Systems
- Targeting GPUs with OpenMP 4.5 Device Directives
- Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading
- OpenMP Accelerator Support for GPUs
- OpenMP compilers and tools
- Performance analysis of OpenMP on a GPU using a CORAL proxy application
- Porting and Optimizing Applications for AC922 servers using OpenMP and Unified Memory

## Appendix

This section lists the non-MPI and MPI matrix multiplication application code segments which we used in our experiments.

#### Non-MPI matrix multiplication

##### Listing 1. Matrix multiplication – non-MPI implementation of multi-GPU offloading using OpenMP

```
void matrix_multiplication_multigpuoffload()
{
//distribute work on multiple GPUs
#pragma omp parallel sections
{
#pragma omp section
{
m_multiply(0); // run on GPU 0
}
#pragma omp section
{
m_multiply(1); // run on GPU 1
}
#pragma omp section
{
m_multiply(2); // run on GPU 2
}
#pragma omp section
{
m_multiply(3); // run on GPU 3
}
}
}
```

##### Listing 2. Matrix multiplication function using OpenMP GPU offloading directives

```
// Note: ROWSIZE and COLSIZE are macros which define number of rows and columns in the matrix.
void m_multiply(int gpuid)
{
int i = 0; int j = 0; int k = 0;
// get the number of devices (GPUs) on the node.
int ng = omp_get_num_devices();
// calculate number of elements per GPU
int num_elements = ROWSIZE*COLSIZE/ng;
// find the starting index of array for respective GPU
int startInd = gpuid*num_elements;
// start and end of i-index for square matrix traversal
int is = gpuid * (ROWSIZE/ng);
int ie = is + (ROWSIZE/ng);
// OpenMP map clause to copy data to/from GPU
#pragma omp target data
map (to: pA[startInd:num_elements],pB[0:ROWSIZE*COLSIZE])
map (tofrom: pC[startInd:num_elements]) device(gpuid)
#pragma omp target
map (to: pA[startInd:num_elements],pB[0:ROWSIZE*COLSIZE])
map (tofrom: pC[startInd:num_elements]) device(gpuid)
// distributes the work among team of GPU threads
#pragma omp teams distribute parallel for collapse(2) private(i,j,k)
for(i=is;i<ie;i++)
{
for(j=0;j<COLSIZE;j++)
{
for(k=0; k<COLSIZE; k++)
{
pC(i,j)+=pA(i,k)*pB(k,j);
}
}
}
}
```

#### MPI matrix multiplication

##### Listing 3. Matrix multiplication – MPI implementation with OpenMP GPU offloading

```
// Note: ROWSIZE and COLSIZE are macros which define number of rows and columns in the matrix.
int main(int argc, char** argv)
{
int pid; // process rank
int np; // number of processes
int ng = 4; // number of GPUs
int msize = COLSIZE; // matrix size
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &np); // assign total number of tasks
MPI_Comm_rank(MPI_COMM_WORLD, &pid); // assign rank
// initialize all matrices
init_arrays_matmul_mpi(np, pid);
MPI_Barrier(MPI_COMM_WORLD);
double start = MPI_Wtime();
// broadcast matrix size to all tasks
MPI_Bcast ( &msize, 1 , MPI_INT , 0 , MPI_COMM_WORLD ) ;
//0th task split pA and send to np tasks.
MPI_Scatter( pA , ROWSIZE/np * COLSIZE, MPI_DOUBLE , recA , ROWSIZE/np * COLSIZE, MPI_DOUBLE , 0, MPI_COMM_WORLD ) ;
// broadcast copy of pB to all tasks
MPI_Bcast ( pB, ROWSIZE * COLSIZE, MPI_DOUBLE , 0 , MPI_COMM_WORLD );
#ifdef MPI_MM_GPU
gpuid = pid*ng/np;
matrix_multiplication_gpuoffload(gpuid, np, pid, recA, recC);
#else // run on CPU
matrix_multiplication(np, pid, recA, recC);
#endif
//0th task gather result (recA) from all tasks.
MPI_Gather( recC, ROWSIZE/np * COLSIZE, MPI_DOUBLE, pC, ROWSIZE/np * COLSIZE, MPI_DOUBLE, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
double end = MPI_Wtime();
double time = end - start ;
printf (" ********** process %d : time = %f \n" , pid , time ) ;
MPI_Finalize();
return 0;
}
```

##### Listing 4. Matrix multiplication function using OpenMP GPU offloading

```
void matrix_multiplication_gpuoffload(int gpuid, int np, int pid, double *pA, double *pC)
{
int i = 0; int j = 0; int k = 0;
#pragma omp target data
map (to: pA[0:ROWSIZE/np*COLSIZE],pB[0:ROWSIZE*COLSIZE])
map (tofrom: pC[0:ROWSIZE/np*COLSIZE]) device(gpuid)
#pragma omp target device(gpuid)
#pragma omp teams distribute parallel for collapse(2) private(i,j,k)
for(i=0;i<ROWSIZE/np;i++)
{
for(j=0;j<COLSIZE;j++)
{
for(k=0;k<ROWSIZE;k++)
{
pC(i,j)+=pA(i,k)*pB(k,j);
}
}
}
}
```

## Acknowledgement

The authors would like to thank Tom Heller and Leopold Grinberg for their valuable inputs and technical review.