Linux on IBM Power Systems application porting and tuning guide

Introduction

Applications and software tools with outstanding performance are part of IBM® Power Systems™ accelerated solutions. IBM Power Architecture® provides outstanding performance for many applications in the open source community. However, these community developers can run into issues with build, performance, or runtime errors when porting their applications to Linux on Power. The goal of this tutorial is to provide tips and show users how to port and migrate their applications to IBM Power® servers with focus on performance tuning and optimization. This tutorial provides the steps to successfully port applications to Linux on Power.

Prerequisites

Porting and migrating applications to Power servers not only requires development tools (such as compilers), debugging tools, and performance libraries, but also recommended actions to enhance application performance.

Compilers

IBM XL compiler family: IBM XL C/C++ and Fortran compilers are specifically built for Power Architecture with high-performance capability. Community versions of the following compilers are available for community users to download:
- IBM XL C/C++ for Linux
- IBM XL Fortran for Linux
IBM Advance Toolchain: Power specific GNU compilers (C/C++ and GFortran) can be download for free. Starting with version 11.0, the compiler can automatically convert single-instruction, multiple data (SIMD) from Intel x86 MMX/SSE/AVX to IBM Power AltiVec/VSX. Check out the following references:
- Installing Advance Toolchain for Linux on IBM Power Systems
- GitHub - Advance Toolchain for Linux on Power build system
Performance tools

IBM Engineering and Scientific Subroutine Library (ESSL) and Mathematical Acceleration Subsystem (MASS) are a wide range of mathematical functions for many different scientific and engineering applications, including those in BLAS, LAPACK, ScaLAPACK, FFTW, MKL, and so on. Refer to the ESSL Community Edition for more details. MASS library is part of the IBM XL compiler.
MPI and OpenMP

Message Passing Interface (MPI) software stacks such as OpenMPI and MPICH can be built and used on Power servers. IBM Spectrum® MPI is built for primary use for Power servers with optimal Parallel Active Message Interface (PAMI) for interconnection scalability.
SIMD mapping tools and guide

Power AltiVec/VSX vector extensions are completely different from x86 MMX/SSE/AVX extensions. Mapping of these x86 MMX/SSE/AVX functions are required for those applications and libraries. Refer to the following resources for mapping SSE to VSX.
- Starting with GCC version 8.0 or IBM Advance Toolchain version 11.0 (GCC 7), the compiler supports direct mapping of SSE to VSX by adding CPP flags
  -DNO_WARN_X86_INTRINSICS
- OpenPOWER Vector Instrinsics guide
Debugging and profiling tools

Many debugging and profiling tools are available on Power, including perf, gprof, gdb, valgrind, perf (or OProfile), in addition to Linux system tools. nmon, a widely used system tool for monitoring processor, memory, I/O, and network activities is also available on Power for application performance.
NVIDIA CUDA tools (GPU support)

CUDA development toolkit for Power (ppc64le) is used for GPU applications for Power. You can download the package from NVIDIA development website (select Linux -> ppc64le). The container support is maintained at: https://hub.docker.com/r/nvidia/cuda-ppc64le.

General application migration from x86 to Power

The process of application porting and migration varies from application to application. This is the general guide when you consider porting an application to Power. The process includes:

Compiling and source code optimization

Most applications can be compiled and run on Power Systems without the need to modify the source codes. Architecture-specific options can be applied to improve the performance, such as
- compiler flags: -O3 -flto -fpeel-loops -funroll-loops -ftree-vectorize -ffast-math -mcpu=power9 -mtune=power9
- SIMD code: -DNO_WARN_X86_INTRINSICS to map MMX/SSE to VSX (only GCC 8+ or at11.0)
- OpenMP or parallel app: -fopenmp -pthread
- IBM MASS: -fvecliabi=mass -lmassvp9 -lmass_simdp9 -lmass -lm.
Because the default char is unsigned on Power, the -fsigned-char option is always used to match x86 signed char default.
Application profiling and debugging (perf, gprof, valgrind, nmon, and so on)

When application execution fails or has poor performance on the system, there are debugging and profiling tools to identify the issues. Most common faults are caused by the incompatible data type, using older version of libraries. Debugging tools can make it easy to identify the problem. Sometimes, inserting code to emit diagnostic information as print statements is needed. Application profiling is used to identify the performance bottlenecks (for example, poor parallel scalability and memory load and store, and so on). These can be improved by replacing the source code to better use the system resources. Some applications include x86-specific (such as #ifdef __SSE__, #ifdef __x86_64__) defines to enable performance on the x86 system. In this case, manually adding -D__SSE__ or -D__PPC__ is necessary to map or replace x86 functions to Power-specific functions. If there is x86 specific assembly code, it must be replaced with Power assembly code.
Tools and libraries for application optimization

Linking applications with performance libraries such as ESSL, OpenBLAS, CuBLAS, LAPACK, ScaLAPACK, FFTW with FMA enabled, and parallel libraries improves the application performance on Power if a significant amount of mathematical functions is used in these applications.
Runtime optimization

IBM POWER® processor core has many simultaneous multithreading (SMT) threads that can increase your system resource usage through system’s dynamic scheduling. For example, BWA, a genomic sequence alignment tool can speed up two to three times when using 160 threads on a 40-core IBM POWER9™ node. Process affinity using taskset and numactl; or setting MPI, OpenMP, or LSF binding variables is also useful to improve application performance. For sequential applications, partitioning the workload to allow parallel execution can take advantage of the scalability of POWER for significant throughput improvement.

Application tuning and optimization tips

The goal of porting and migrating applications to Power is to improve workload performance. Tuning and optimizing applications for Power is always an important step. Different applications require different techniques to increase performance. The following techniques can help:

Mapping MMX/SSE/AVX to VSX

IBM has published tips (https://www.ibm.com/support/pages/vectorizing-fun-and-performance) and a guide to help porting code containing MMX/SSE/AVX to VSX (https://openpowerfoundation.org/?resource_lib=linux-power-porting-guide-vector-intrinsics). Additionally, the following tips are recommended for porting applications involving SSE/AVX mapping:
- Using GCC 8+ and IBM Advance Toolchain 11+:
  
  The new GCC compiler makes it easy to map most of the MMX/SSE functions to AltiVec/VSX without the need to modify the source code by adding -DNO_WARN_X86_INTRINSICS to compiler flags. When this flag is defined, the compiler will search for header files (such as emmintrin.h, xmmintrin.h, pmmintrin.h, smmintrin.h and pmmintrin.h) in $COMPILER_HOME/ lib/gcc/powerpc64le-linux-gnu/$VERION/include to the mapping functions.
- Replacing Intel® ASM with Power ASM or generic GCC built-in functions:
  
  Many applications with SSE/AVX enabled include assembly code in the program. This assembly code is platform-specific and must be replaced with GCC built-in functions or Power assembly code. For example,
  
  Intel:
  #define popcnt(x) asm("popcnt %[x], %[val]” : [val] "+r" (x) : : "cc")
  
  Power:
  #define popcnt(x) __builtin_popcount(x)
- Manually replacing SIMD functions:
  
  If POWER processor-compatible definitions of MMX/SSE/AVX functions are not available, they need to be ported manually. These functions (usually with _mm…) need to be ported. First, try to understand the function by looking at Intel intrinsic guide, then mapping this function to an identical VSX function. If an identical function is not found, multiple VSX or generic functions can be used to implement and validate it. For example:
```
/* Set the MXCSR control and status register with the value in unsigned 32-bit integer a.
* MXCSR := a[31:0]
* Use FPSCR/VSCR for Power */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setcsr (unsigned int val)
{
union {
   double d;
   unsigned int val[2];
} u;
u.val[0] = 0xFFF80000;
u.val[1] = val;
__asm__ __volatile__("mtfsf 255,%0" : : "f"(u.d));
}

/* New Functions for SSE4.1
* Ruzhu Chen added 07-26-2019 */
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_min_epi32 (__m128i __A, __m128i __B)
{
return (__m128i)vec_vminsw((__v4si)__A, (__v4si)__B);
}
```
POWER processor-specific compiler options

IBM XL compiler family and IBM Advance Toolchain compilers have specific computer options for providing performance improvement. The following tips serve as a guide to modify your compiled script.

IBM Advance Toolchain compiler or GCC 8+
- Remove -msse or -msse4.1 and other AVX compiler options.
- Add -DNO_WARN_X86_INTRINSICS to compiler flags.
- Define default CPP in the code: __x86_64__ or __amd64__. Add/ replace with __PPC__ or __PPC64__ or -D__x86_64__ to CPPFLAGS.
- Search for __SSE__,__SSE4.1__,… __AVX__ in source code and define it in CPPFLAGS.
- Check if config.guess includes ppc64le to configure files. If not, run autoreconf -if to reconfigure the code.
- By default, char type on POWER is unsigned, whereas, it is signed on x86. Always include -fsigned-char when compiling.
- Add POWER processor-specific options: -mcpu=power9, -mtune=power9 (if no SSE conversion)
- Include penMP: -fopenmp
- If using ESSL or MASS, link with XL runtime libraries -lxlsmpopt and/or other IBM XL compiler libraries may be needed.
IBM XL compilers
- XL compilers: xlc, xlC, xlf, xlf90 or xlc_r, xlC_r, xlf_r and xlf90_r if threaded.
- If MMX/SSE/AVX code, use Veclib scripts (-Ipath-to/veclib/include) to convert MMX/SSE/VX functions to VSX and compile with “-qaltivec” option.
- XL compiler options: -O3 -qarch=pwr9 -qtune=pwr9 -qcache=auto
- OpenMP options: -qsmp=omp or -fopenmp
- Use with CUDA: for example, nvcc -ccbin xlC -m64 -Xcompiler -O3 -Xcompiler -q64 -Xcompiler -qsmp=omp -gencode arch=compute_70,code=sm_70.
- Signed char: -qchar=signed
- XL Fortran: -Dsome_define -> -WF,-Dsome_define for older XL Fortran version (15 and earlier) and option -WF,-C! for C/C++ style comment.

Debugging and profiling tips

Profiling and debugging steps are optionally used for investigating performance issues. Some applications may need performance analysis and tuning to achieve maximum performance on Linux on Power. Profiling helps to identify the performance bottlenecks which can be modified to boost performance on Power.

Profiling application with gprof

To use gprof, compile and link the program with the -pg option and then run the program to generate output, gmon.out. Use gprof to analyze the profile data. There are two output, flat profile and call graph for analysis. Flat profile includes the timing information on each function and how many times the function is being called and the most calling functions, while call graph gives the relationship of each function in the application and the function calling tree and the timing information.
Profiling application with perf

Compile the application with the -pg command and run the application to generate the output using the gmon.out command. The output is then analyzed with perf report --stdio --sort=pid for binary image summary or perf report --stdio -n for symbol summary including per-application libraries. To annotate the source code, use the perf annotate --stdio [-n] command and generate timing information in the source code.
Application debugging tips

Using the Valgrind debugger is simple without the need to modify, recompile, or relink your program. It can debug large programs in almost any kind of software written in any language. It also works with the gdb debugger. Refer to the following example.

Figure 1. Source code causes error

Figure 2. Debugging output from Valgrind

Application runtime tuning tips

The goal of the runtime tuning is to maximize the system resource usage to improve application throughput. Some applications have shown to improve up to 10 times faster with tuning for POWER.

Tuning with system tools

Using taskset and numactl to bind processor and memory affinity can improve multithread programs significantly. The htop or top system command is used to monitor the processor affinity. Other commands such ppc64_cpu, ulimit, cpupower, nmon, vmstat, mpstat, iostat, netstat, and so on can be used to monitor the application execution.
Increase CPU/GPU usage

Many applications are thread-enabled or implemented with MPI/OpenMP. These applications can be tuned with proper system or application parameters, such as number of threads, message or packet sizes, temp execution folder, and batch size to increase CPU or GPU usage.
Increase throughput

Some applications do not scale well with certain number of threads or use single thread or run poorly due to the I/O or memory bottleneck. These can be tuned by running optimal number of threads or preloading a large amount of data to the memory to improve the throughput. For instance, a GATK application pipeline script can be modified to partition input data sets into multiple intervals to run on a 40-core POWER9 processor to improve performance from 85 hours to 6 hours.

Power AppStore - Anaconda community support

Many prebuilt applications in the open source community are now available on anaconda cloud for users to install and use conveniently, but most application developers do not provide prebuilt binaries for the Power platform. Even if binaries are available for Power, they are built generically. Therefore, we need to create our own anaconda channel for Power with tuned and optimized binaries. The BioHPDA channel is an example we created for hosting these applications for community users. You can easily use the applications in the anaconda channel to create container images for the cloud environment.

Build applications for anaconda channel.

Anaconda channel is an app store for Power Systems users to share and contribute their optimized open source applications. To build an application for the store, the conda-build package is needed.
- Install anaconda or miniconda in your user environment and activate it.
- Install conda-build: conda install conda-build
- Download and build your application: port your application to the Power server first. After successful porting, tuning, and executing, create build.sh and meta.yaml scripts to your recipe folder. If a patch is needed, create a patch and specify in your YAML file, which contains dependency requirements and source code path. Then issue conda-build. to build the application. If successful, a .bz2 file is created.
- Test your build with conda install your-path-to/your-app.tar.bz2
Publish build recipes with optimized scripts and patches.

Your conda build recipe is included in your application build package, which can be shared with the open source community. A place to host these recipes is https://github.com/ppc64le/build-scripts.git. The BioHPDA channel recipes, which contains more than 480 genomics and cryo-EM applications and tools, are available here.
Use the anaconda channel.

The anaconda channel can be used to install individual applications or create an environment to install multiple applications.
- Install individual application:
  
  Before installing an application, the specific anaconda channel needs to add to your environment using the conda config --add channels channel-url command. Then you can install the application using the conda install your-app=version command.
- Install conda environment: Create an environment YAML file as shown in the following example.
  
  To install an environment, run the conda env create -f environment.yaml command.

Summary

It is possible to make full use of the strengths of the IBM POWER platform by porting, tuning, and optimizing applications. This is a general tuning and optimization guide with tips for users to enable their applications on Power servers. This tutorial provided the resources and techniques necessary for supporting and creating accelerated solutions on Linux on Power.