Linux on IBM Power application porting and tuning guide

Introduction

Applications and software tools with outstanding performance are part of IBM Power accelerated solutions. IBM Power Architecture provides outstanding performance for many applications in the open-source community. However, these community developers can run into issues with build, performance, or runtime errors when porting their applications to Linux on Power. The goal of this tutorial is to provide tips and show users how to port and migrate their applications to IBM Power servers with focus on performance tuning and optimization. This tutorial provides the steps to successfully port applications to Linux on Power.

Prerequisites

Porting and migrating applications to Power servers not only requires development tools (such as compilers), debugging tools, and performance libraries, but also recommended actions to enhance application performance.

Compilers

IBM XL compiler family: IBM Open XL C/C++ and Fortran compilers are specifically built for Power Architecture with high-performance capability. Community versions of the following compilers are available for community users to download:
- IBM Open XL C/C++ for Linux on Power
- IBM Open XL Fortran for Linux on Power
IBM Advance Toolchain: Power specific GNU compilers (C/C++ and GFortran) can be downloaded for free. With version 19.0, the compiler can automatically convert single-instruction, multiple data (SIMD) from Intel x86 MMX/SSE/AVX to IBM Power AltiVec/VSX. Check out the following references for more information:
- Installing Advance Toolchain for Linux on IBM Power Systems
- GitHub - Advance Toolchain for Linux on Power build system
Performance tools

IBM Engineering and Scientific Subroutine Library (ESSL) Parallel ESSL (PESSL) and Mathematical Acceleration Subsystem (MASS) are a wide range of mathematical functions for many different scientific and engineering applications, including those in Basic Linear Algebra Subprograms (BLAS), Linear Algebra PACKage (LAPACK), ScaLAPACK, Fastest Fourier Transform in the West (FFTW), Math Kernel Library (MKL), and so on. Refer to Engineering and Scientific Subroutine Library (ESSL / PESSL) for details on ESSL. MASS library is part of the IBM Open XL compiler, refer to Mathematical Acceleration Subsystem (MASS) Libraries for details.
MPI and OpenMP

Message Passing Interface (MPI) software stacks, such as OpenMPI and MPICH can be built and used on Power servers. IBM Spectrum MPI is built for primary use for Power servers with optimal Parallel Active Message Interface (PAMI) for interconnection scalability.
SIMD mapping tools and guide

Power AltiVec/VSX vector extensions are completely different from x86 MMX/SSE/AVX extensions. Mapping of these x86 MMX/SSE/AVX functions are required for those applications and libraries. Refer to the following resources for mapping SSE to VSX.
- With GCC version 16.1 or IBM Advance Toolchain version 19.0 (GCC 15.1.1), the compiler supports direct mapping of SSE to VSX by adding CPP flags -DNO_WARN_X86_INTRINSICS
- OpenPOWER Vector Instrinsics guide
Debugging and profiling tools

Many debugging and profiling tools are available on Power, including perf, gprof, gdb, valgrind, perf (or OProfile), in addition to Linux system tools. nmon, a widely used system tool for monitoring processor, memory, I/O, and network activities is also available on Power for application performance. NUMA awareness tools, such as numactl or taskset are also available for thread binding.

General application migration from x86 to Power

The process of application porting and migration varies from application to application. This is the general guide when you consider porting an application to Power. The process includes the following tasks.

Portability audit

Before porting to the new architecture, developers should perform a thorough audit of the codebase to identify non‑portable constructs. Pay particular attention to:

Architecture macros: Search for __x86_64__ or similar preprocessor checks that hard‑code assumptions about x86 platforms. Replace them with more general feature detection or portable alternatives.
Inline assembly: Inline asm blocks are often architecture‑specific. Review and refactor them into portable intrinsics or conditional code paths where possible.
immintrin.h usage: This header provides Intel SIMD intrinsics (SSE/AVX). Code relying on it will not compile on non‑x86 systems. Consider abstracting vector operations or using libraries with cross‑platform SIMD support.
Endian/alignment assumptions: Audit code for assumptions about byte order (little versus big endian) and memory alignment. Use standard macros (htobe32, be32toh, and so on) or portable APIs to handle endianness safely.

Build system fixes

Perform the following steps to build system fixes:

Update the config.guess file.
- Obtain the latest config.guess script from the GNU project:
```
wget -O config.guess 'https://git.savannah.gnu.org/cgit/config.git/plain/config.guess'
```
- Replace the existing config.guess in your project's build-aux or config directory.
Regenerate the autotools scripts.

Run the autoreconf command to refresh the configure script and the related files. This ensures that the updated config.guess file is properly integrated with the command:
```
autoreconf –fi
```
Verify processor detection using the CMake tool.
- When using CMake, confirm that the system processor is correctly identified using the following script snippet:
```
message(STATUS "Processor: ${CMAKE_SYSTEM_PROCESSOR}")
```
- If CMake does not automatically detect ppc64le, explicitly set it using the following command:
```
cmake -DCMAKE_SYSTEM_PROCESSOR=ppc64le ..
```

Compiling and source code optimization

Most applications can be compiled and run on Power servers without the need to modify the source codes. If you plan to use the GNU Compiler Collection (GCC), refer to the GCC documentation for comprehensive information about it. You can improve performance using the following architecture-specific options:

Compiler flags: -O3 -flto -fpeel-loops -funroll-loops -ftree-vectorize -ffast-math -mcpu=power11 -mtune=power11
SIMD code: -DNO_WARN_X86_INTRINSICS to map MMX/SSE to VSX (only GCC 16.1+ or AT 19.0)
OpenMP or parallel app: -fopenmp -pthread
IBM MASS: -fvecliabi=mass -lxlsmp -lm

As the default char option is unsigned on Power, the -fsigned-char option is always used to match the x86 signed char default.

For a brief guide for setup and performance tuning of GCC on IBM Power, refer to the IBM Support documentation.

Application profiling and debugging** (perf, gprof, valgrind, nmon, and so on)

When application execution fails or has poor performance on the system, there are debugging and profiling tools to identify the issues. Most common faults are caused by the incompatible data type, using older version of libraries. Debugging tools can make it easy to identify the problem. Sometimes, inserting code to emit diagnostic information as print statements is needed. Application profiling is used to identify the performance bottlenecks (for example, poor parallel scalability and memory load and store, and so on). These can be improved by replacing the source code to better use the system resources. Some applications include x86-specific macros (such as #ifdef __SSE__, #ifdef __x86_64__) to enable performance on the x86 system. In this case, manually adding a compiler flag, such as -D__PPC64__, is necessary to map or replace x86 functions to Power-specific functions. If there is an x86-specific assembly code, it must be replaced with a Power assembly code.

Tools and libraries for application optimization

Linking applications with optimized performance libraries such as ESSL 7.1, OpenBLAS, cuBLAS, LAPACK, ScaLAPACK, and FFTW, along with enabling FMA and parallel libraries, can significantly improve application performance on Power systems, particularly for applications that make extensive use of mathematical computations.

Runtime optimization

IBM Power processor core has many simultaneous multithreading (SMT) threads that can increase your system resource usage through system's dynamic scheduling. IBM Power9, Power10, and Power11 hardware can utilize up to eight threads per core. For example, Burrows-Wheeler Aligner (BWA), a genomic sequence alignment tool can speed up two to three times when using 160 threads on a 40-core IBM Power node. Process affinity using taskset and numactl, or setting IBM Spectrum MPI 7.3, OpenMPI 5.x, and MPICH 5.x, or IBM Spectrum LSF binding variables is also useful to improve application performance. For sequential applications, partitioning the workload to allow parallel execution can take advantage of the scalability of Power for significant throughput improvement.

NUMA alignment

NUMA tuning keeps compute close to memory. On Power, inspect topology (using the lscpu and numactl --hardware commands and tune SMT using the ppc64_cpu command); test interleave for bandwidth-heavy jobs; validate with real workload benchmarks, because bad pinning can reduce throughput or raise latency. To summarize, NUMA awareness tools such as numactl allow you to pin threads to specific CPUs or memory nodes. This avoids costly remote memory access and improves performance consistency on multi‑socket, multi‑node systems.

Container / CI recommendation

For continuous integration (CI) and performance testing on ppc64le systems, developers should prefer native Linux/ppc64le containers over emulation with QEMU. Native containers are preferred because of the following features:

Performance accuracy: Native containers run directly on the hardware, ensuring benchmarks and performance tests reflect real-world results.
Improved performance: QEMU introduces significant emulation latency, which can distort timing, throughput, and resource usage metrics.
Simplified debugging: Native containers avoid emulation-related quirks, making it easier to diagnose architecture-specific issues.
CI consistency: Using native containers ensures that CI pipelines produce reproducible results aligned with production environments.

Reserve QEMU usage for functional validation only when native hardware access is unavailable.

Application tuning and optimization tips

The goal of porting and migrating applications to Power is to improve workload performance. Tuning and optimizing applications for Power is always an important step. Different applications require different techniques to increase performance. The following techniques can help.

Mapping MMX/SSE/AVX to VSX

IBM has published tips (https://www.ibm.com/support/pages/vectorizing-fun-and-performance) and a guide to help porting code containing MMX/SSE/AVX to VSX (https://openpowerfoundation.org). Additionally, the following tips are recommended for porting applications involving SSE/AVX mapping (note that you should rely on modern auto-vectorization (-O3 -ftree-vectorize) as a first step before manual VSX intrinsic translation).

Using GCC 16.1+ and IBM Advance Toolchain 19+

The GCC compiler makes it easy to map most of the MMX/SSE functions to AltiVec/VSX without the need to modify the source code by adding the -DNO_WARN_X86_INTRINSICS option to the compiler flags. When this flag is defined, the compiler will search for header files (such as emmintrin.h, xmmintrin.h, pmmintrin.h, smmintrin.h and pmmintrin.h) in the $COMPILER_HOME/lib/gcc/powerpc64le-linux-gnu/$ VERSION/include path to map the functions.

Replacing Intel ASM with Power ASM or generic GCC built-in functions

Many applications with SSE/AVX enabled include assembly code in the program. This assembly code is platform-specific and must be replaced with GCC built-in functions or Power assembly code. For example, run the following commands in Intel and Power system:

Intel:

#define popcnt(x) asm("popcnt %[x], %[val]" : [val] "+r" (x) : : "cc")

Power:

#define popcnt(x) __builtin_popcount(x)

Manually replacing SIMD functions

If Power processor-compatible definitions of MMX/SSE/AVX functions are not available, they need to be ported manually. These functions (usually with _mm) need to be ported. First, try to understand the function by looking at the Intel Intrinsic Guide, then mapping this function to an identical VSX function. If an identical function is not found, multiple VSX or generic functions can be used to implement and validate it as shown in the following examples:

/* Set the MXCSR control and status register with the value in unsigned 32-bit integer a.
 * MXCSR := a[31:0]
 * Use FPSCR/VSCR for Power */
extern __inline void __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_setcsr (unsigned int val)
{
  union {
    double d;
    unsigned int val[2];
  } u;
  u.val[0] = 0xFFF80000;
  u.val[1] = val;
  __asm__ __volatile__("mtfsf 255,%0" : : "f"(u.d));
}

/* New Functions for SSE4.1
 * Ruzhu Chen added 07-26-2019 */
extern __inline __m128i __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_min_epi32 (__m128i __A, __m128i __B)
{
  return (__m128i)vec_vminsw((__v4si)__A, (__v4si)__B);
}

Power processor-specific compiler options

IBM Open XL compiler family and IBM Advance Toolchain compilers have specific computer options for providing performance improvement. The following tips serve as a guide to modify your compiled script.

IBM Advance Toolchain compiler or GCC 16.1+

Remove the -msse and -msse4.1 options and other AVX compiler options.
Add the -DNO_WARN_X86_INTRINSICS option to compiler flags. This is a compatibility/migration aid that requires unit testing and manual validation, rather than an automatic conversion tool.
Strip out #ifdef __SSE__ blocks or replace them with portable alternatives, such as #ifdef __PPC64__ or -D__PPC64__.
By default, the char type on Power is unsigned, whereas, it is signed on x86. Always include the -fsigned-char option when porting applications written for x86.
Add Power processor-specific options: -mcpu=power11, -mtune=power11 (if no SSE conversion). Binaries built with these options may not run on older processors.
Include openMP: -fopenmp
If using ESSL 7.1 or MASS 5.2, link with Open XL runtime libraries, such as -fopenmp. XL's proprietary SMP runtime libraries are obsolete—modern builds with ESSL/MASS should rely on the LLVM/OpenMP runtime provided by IBM Open XL.

IBM Open XL compilers

Open XL compilers: ibm-clang, ibm-clang++, ibm-openxl-fortran, or ibm-clang –pthread, ibm-clang++ -pthread, ibm-openxl-fortran -pthread if threaded.
If MMX/SSE/AVX code, use Veclib scripts (-Ipath-to/veclib/include) to convert MMX/SSE/VX functions to VSX and compile with the -qaltivec option.
Open XL compiler options: -O3 -mcpu=power11 -mtune=power11
OpenMP option: -fopenmp
Signed char: -fsigned-char

Debugging and profiling tips

Profiling and debugging steps are optionally used for investigating performance issues. Some applications may need performance analysis and tuning to achieve maximum performance on Linux on Power. Profiling helps to identify the performance bottlenecks which can be modified to boost performance on Power.

Profiling application with gprof

To use gprof, compile and link the program with the -pg option and then run the program to generate output, gmon.out. Use gprof to analyze the profile data. There are two output, flat profile and call graph for analysis. Flat profile includes the timing information on each function and specifies how many times the function is being called and the most calling functions, while call graph gives the relationship of each function in the application and the function calling tree and the timing information.

Profiling application with perf

To profile an application with the perf tool, you compile and run your program normally, use the perf record option to capture performance data and perf report to analyze hotspots. This workflow gives you detailed insight into CPU usage, call graphs, and bottlenecks. It shows you where your program spends time, which functions are bottlenecks, and whether CPU/memory usage is efficient.

Application debugging tips

The Valgrind debugger helps developers detect memory errors, threading bugs, and performance issues in programs. Using Valgrind is simple without the need to modify, recompile, or relink your program. It can debug large programs in almost any kind of software written in any language. It also works with the gdb debugger. For more information, refer to The Valgrind Quick Start Guide.

Application runtime tuning tips

The goal of the runtime tuning is to maximize the system resource usage to improve application throughput. Some applications have shown to improve up to 10 times faster with tuning for Power.

Tuning with system tools

Using the taskset and numactl utilities to bind processor and memory affinity can improve multithread programs significantly. The htop or top system command is used to monitor the processor affinity. Other commands such ppc64_cpu, ulimit, cpupower, nmon, vmstat, mpstat, iostat, netstat, and so on can be used to monitor the application execution.

Increase CPU/GPU usage

Many applications are thread-enabled or implemented with IBM Spectrum MPI 7.3, OpenMPI 5.x, and MPICH 5.x. These applications can be tuned with proper system or application parameters, such as number of threads, message or packet sizes, temp execution folder, and batch size to increase CPU or GPU usage.

Increase throughput

Some applications do not scale well with certain number of threads or use single thread or run poorly due to the I/O or memory bottleneck. These can be tuned by running optimal number of threads or preloading a large amount of data to the memory to improve the throughput. For instance, a GATK application pipeline script can be modified to partition input data sets into multiple intervals to run on a 40-core Power9 processor to improve performance from 85 hours to 6 hours.

Python wheel central repository - Optimized Python wheels for Power

To install an IBM Power (ppc64le) specific prebuilt wheel from the IBM Power DevPI repository, use the command:

pip install --prefer-binary <package-name> \
  --extra-index-url=https://wheels.developerfirst.ibm.com/ppc64le/linux

The available wheels support IBM Power9, Power10 and Power11, and Python versions 3.10 - 3.13. For more information, check this readme file.

Pin the wheel sources in the dependency management files (such as requirements.txt and lock files) to ensure reproducible builds.

Verify that the wheels are compatible with the correct Python Application Binary Interface (ABI) tags (for example, cp313, cp314). Mismatched ABIs can cause runtime errors on target systems.

Some Linux distributions use musl instead of glibc. Ensure wheels are built and tested against musl when required and avoid mixing the musl and glibc environments.

Summary

It is possible to make full use of the strengths of the IBM Power platform by porting, tuning, and optimizing applications. This is a general tuning and optimization guide with tips for users to enable their applications on Power servers. This tutorial provided the resources and techniques necessary for supporting and creating accelerated solutions on Linux on Power.

Products

Languages

Technologies

All Events

External Resources