Introduction

This article is intended to document the “state of the art” for performance tuning, specifically targeted at applications running on Linux on IBM® Power Systems™.  This document is a collection of wisdom from experience, but is not meant to be exhaustive. It is a living document, with content to be added as new information is discovered. Not all sections are complete, nor even guaranteed to be accurate. Feedback on the Forums is welcome and encouraged.

The document includes the following sections:

Review the following documents for additional performance tips:

Application (black box)

Use optimal SMT mode

# ppc64_cpu --smt=<n>

Where n is the number of threads per core (for POWER7®:  1, 2, 4; for POWER8®: 1, 2, 4, 8)

  • SMT=1:  1 hardware thread per core; provides the most processor resources per hardware thread; thus, best performance per thread at the cost of fewer threads
  • SMT=2:  2 hardware threads per core; better throughput than SMT=1 at the cost of some individual thread performance
  • SMT=4:  4 hardware threads per core; better throughput than SMT=2 at the cost of some individual thread performance
  • SMT=8:  8 hardware threads per core; better throughput than SMT=4 at the cost of some individual thread performance (available on POWER8 only)

Note that the higher throughput with more threads per core is mitigated by higher contention for shared processor resources, and is highly dependent on workload characteristics.  (it may be, for example, that SMT8 does not perform significantly better than SMT4.)

ppc64_cpu is a simple command to change the SMT mode for all cores at once. Nicely, there is also a way to change the SMT mode for each individual core.

Note that on Power, all hardware threads always “exist”, but are either online or offline, depending on the SMT mode. The “primary” (always on) hardware threads are numbered n, where n is a multiple of the maximum number of hardware threads per core (the maximum threads per core on POWER7 is four, so primary threads are 0, 4, 8, and so on; The maximum threads per core on POWER8 is eight, so primary threads are 0, 8, 16, and so on). SMT=2 will additionally enable the next (n+1) thread on each core. SMT=4 will additionally enable the next 2 threads (n+2 and n+3) on each core, and SMT=8 will enable all 8 threads on each core (SMT=8 is available with POWER8 only).

For example, on a POWER7 system, to put all cores in SMT=4 mode, and then put the first core in SMT=1 mode:

# ppc64_cpu --smt=4
# echo 0 > /sys/devices/system/cpu/cpu3/on-line
# echo 0 > /sys/devices/system/cpu/cpu2/on-line
# echo 0 > /sys/devices/system/cpu/cpu1/on-line

Processor binding

The Linux scheduler attempts to locate processes and their memory in close proximity on the system, but also takes into account other factors, which may or may not be optimal for the critical workload.  If for example, the scheduler determines that a core is too heavily loaded when a task wakes up, it may choose to migrate that task to a new core.  This can have deleterious effects due to the sub-optimal location in the memory hierarchy of the task’s data in the context of running on the new core(s).

It can be advantageous to give further guidelines to the scheduler to force task(s) to be bound for scheduling to a certain core or set of cores.

To launch a new process bound to the set of all cores in NUMA node 0:

# numactl --cpunodebind=0 <command>

To launch a new process bound to just core 0:

# numactl --physcpubind=0 <command>

Here, taskset may also be used:

# taskset -c 0 <command>

taskset may also be used to change the binding of a running process:

# taskset -p -c 0 <pid>

Note, however, that a task which is already running will have already allocated some memory, so binding a task after it has started may result in sub-optimal NUMA layout, executing on the newly bound core, but with memory located elsewhere.

cpusets may also be created as “buckets” of cores on which tasks can be scheduled, and from which new tasks can be excluded.  The cset command is one way to do this (although the project seems to have gone inactive and it’s not clear if it works with modern distributions).  Another, more manual means is by direct interaction with the cgroups filesystem.

To set up a cgroup, “mycgroup”, with a subset of the available CPUs and all memory:

# mount -t tmpfs tmpfs /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpuset
# mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
# cd /sys/fs/cgroup/cpuset
# mkdir mycgroup
# cd mycgroup
# echo 8-95 > cpuset.cpus
# cat /sys/fs/cgroup/cpuset/cpuset.mems > cpuset.mems

To push a task with process ID into the “mycgroup” cgroup:

--
# echo  >> /sys/fs/cgroup/cpuset/mycgroup/tasks
--

Interrupt binding

It can be a good idea to bind the thread that handles I/O to the same core to which the interrupt is delivered. Binding can also help avoid a thread being migrated away from its cached data.

First, turn off the IRQ balancing daemon (irqbalanced):

# systemctl stop irqbalance
# systemctl disable irqbalance

Bind selected interrupt to cores.  The first example below uses a hexadecimal bitmask.  The second uses a simple comma-separated list.  Note the different names of each target file.  Both examples are equivalent:

# echo 0xf > /proc/irq/<IRQ>/smp_affinity
# echo 0,1,2,3 > /proc/irq/<IRQ>/smp_affinity_list

Fine tune irqbalanced (Reference: SPECjEnterprise2010 – A performance case study):

If you need the irqbalance service to continue to balance the IRQs that you don’t pin, then you can configure irqbalance not to change the CPU pinnings for IRQs you pinned. In the /etc/sysconfig/irqbalance file, set the IRQBALANCE_ARGS parameter to ban irqbalance from changing the CPU pinnings for your IRQs.

IRQBALANCE_ARGS="--banirq=34 --banirq=35"

You must restart the irqbalance service for the changes to take effect:

# systemctl restart irqbalance

Also, look at IRQBALANCE_BANNED_CPUS to shield a list of CPUs from being chosen for IRQ handling by irqbalanced. For example, to isolate CPUs 8-95 from being selected by irqbalanced:

IRQBALANCE_BANNED_CPUS=ffffffff,ffffffff,ffffff00

DSCR

DSCR (Data Stream Control Register) controls how far ahead the processor will attempt to pre-fetch when it detects sequential loads.  In a workload with sequential access patterns, even if there is a non-contiguous stride, this can have significant performance benefits by avoiding cache misses.

# ppc64_cpu –dscr=<n>

Refer to POWER ISA for appropriate values of n (0: on (default); 1: off)

Untangling memory access measurements is a good article that explains the meaning and impact of the DSCR.

Huge pages

Modern Linux kernels use a default page size of 64 KB on Power.  This is appropriate for most workloads.  Workloads with a large working set of memory may benefit from a larger page size, by avoiding page faults and TLB thrashing.  Power kernels also offer 16 MB pages.

# hugeadm --create-global-mounts
# hugeadm --pool-pages-min=DEFAULT:<size>

To reserve 100 huge pages:

# hugeadm --pool-pages-min=DEFAULT:100
# hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
  16777216      100      100      100        *
17179869184        0        0        0     
# grep ^Huge /proc/meminfo
HugePages_Total:     100
HugePages_Free:      100
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

To use at least some of those 100 huge pages for the dynamic memory area (heap), here using the “sleep” process as a simple example:

# hugectl --heap /bin/sleep 1000 &
[1] 59156
# grep ^Huge /proc/meminfo
HugePages_Total:     100
HugePages_Free:       99
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:      16384 kB

Note that HugePages_Free has dropped to 99.

The behavior of libhugetlbfs is really controlled by environment variables, which you could set instead of using the hugectl command.  To see what hugectl is really doing:

# hugectl --heap --dry-run /bin/sleep 1000
HUGETLB_VERBOSE='2'
LD_LIBRARY_PATH='/usr/lib:/usr/lib64:'
HUGETLB_MORECORE='yes'

It is also possible to have shared memory segments use huge pages:

# hugectl –shm ./shmtest

The hugectl command nicely turns up verbosity, so failures, which result in falling back to using the default pages, are reported:

# hugectl --shm ./shmtest
libhugetlbfs: WARNING: While overriding shmget(16777216) to add SHM_HUGETLB: 
Cannot allocate memory
libhugetlbfs: WARNING: Using small pages for shmget despite HUGETLB_SHM

See the Application (gray box) section for more information on relinking with libhugetlbfs to use huge pages with the program’s text, data, and bss sections.

The Service and Productivity Tools for Linux on Power Servers includes a “Large Page Analysis” tool which can be used to analyze a performance scenario and recommend huge page configuration.

tcmalloc

Refer to this page: http://code.google.com/p/gperftools

According to this reference document, http://gperftools.googlecode.com/svn/trunk/doc/tcmalloc.html:

  • TCMalloc is faster than the glibc 2.3 malloc
  • TCMalloc reduces lock contention for multi-threaded programs
  • TCMalloc has space-efficient representation for small objects

There are two ways to make use of TCMalloc:

  • Relink the application/libraries.
  • Use LD_PRELOAD at runtime. Caveat: LD_PRELOAD is tricky, and we don’t necessarily recommend this mode of usage.
    LD_PRELOAD=/<path to>/libtcmalloc.so


Application (gray box)

No code changes, just compiling and linking.

Compiler options

  • -O3 (full optimization)
  • -mcpu=native
    Use the full instruction set of “this” machine, with the side-effect of being non-portable to prior chip architectures. Building with -mcpu=native on POWER8 would not be guaranteed to run on POWER7, for example.
  • -funroll-loops (especially for small, tight loops)
  • -fpeel-loops

The IBM Knowledge Center has a list of additional compiler options.

Link-Time Optimization (LTO)

  • Use -flto for both compilation and linking
  • Especially good for C++ templates
  • Can convert virtual functions to static or inline usage

Feedback-Directed Optimization (FDO)

  • Instrument linked code
  • Run critical workload, producing a profile
  • Relink using the profile for workload-specific inter-compilation unit optimizations

Feedback-Directed Program Restructuring (FDPR)

  • Instrument linked code
  • Run critical workload
  • Automatically and directly apply optimizations to the binary (applications and libraries)
  • Potentially: iterate with more extreme options
  • Analyze journal (–journal) to effect suggested code changes

PATH=/opt/ibm/fdprpro/bin:${PATH} LD_LIBRARY_PATH=/opt/ibm/fdprpro/lib fdpr --instrument --train ./train.sh --optimize <program>

The code sample above creates an FDPR-optimized version of <program> as <program>.fdpr.  Note that you have to create the “training” script, shown above as “train.sh”, which accepts the program name as the first and only argument, and will run a suitable “benchmark” to be profiled and for which to apply the optimizations.  If the <program> is the benchmark and takes no arguments, the training script could simply be:

#!/bin/sh
"$1"

IBM XL compilers

Huge pages

Huge pages can be further exploited to hold the text, data, and bss sections of a running program by linking the program with libhugetlbfs.

See this HOWTO on GitHub for more information libhugetlbfs/HOWTO

tcmalloc

See the tcmalloc section above, referenced here due to the requirement to relink application.

Application (white box)

Optimization and tuning guides

Review the following optimization and tuning guides for the latest best practices:

IBM SDK for Linux on Power

IBM provides a free software development kit (SDK) expressly for working with Linux applications on Power.  The IBM SDK for Linux on Power provides a large set of useful tools, including:

  • Migration Advisor
  • Source Code Analyzer
  • CPI Breakdown Tool
  • IBM Advance Toolchain

See the IBM SDK for Linux on Power tutorial for additional information.

Shared Persistent Data Heap Environment (SPDHE)

Shared Persistent Data Heap Environment (SPDHE) has many neat uses, including a very low-overhead time stamp:

#include <stdio.h>
#include <string.h>
#include <sphde/sphtimer.h>

int main(int argc, char *argv[]) {
 sphtimer_t before, after;
 sphtimer_t timer_freg;
 double seconds;
 int i;
 unsigned long iters;

 iters = strtoul(argv[1],0,10);

 timer_freg = sphfastcpufreq ();
 before = sphgettimer ();
 for (i = 0; i < iters; i++) {
   after = sphgettimer ();
 }
 seconds = (double) (after - before) / (double) timer_freg;
 printf ("sec spent: %lf\n", seconds);
}

Use threads

Power Systems have a lot of bandwidth to exploit, so effective utilization of multithreading can make better utilization of available resources.

Vector up

For compute-intensive tasks, there may be opportunities for significant performance improvements by changing code to incorporate vector processing. Vector support in the Power processor family has been known as “AltiVec” (PowerPC), “VMX”, and “VSX” (Power).  VSX is the most modern incantation.  The VSX instructions are documented in the Power ISA [link]. Both the GCC and IBM XL compilers support VSX through compiler built-in primitives and semantics (GCC, XL C).

For an example, consider finding the maximum value in an array.  An easy way to do this in C is a simple loop which scans the array one element at a time:

max = array[0];
for (i = 1; i < len; i++) {
  if (max < array[i]) max = array[i];
}

This code is obviously compact and easy to understand, but does not fully exploit the capabilities of the hardware.

POWER8 has up to 32 128-bit vector registers which can be divided into a vector of four 32-bit quantities which can each represent a single-precision floating point number. Among other data types, as well.  This example is glossing over a lot of details and capabilities of the hardware.

GCC provides new types for vectors of various native types, including “__vector float”, which is ostensibly an array of four floats.

Programming using vector types is a bit more complicated, but the performance benefits can be very significant.  If, for example, you had a fixed length array of eight floats, you could load two “vector float” variables, perform a “vec_max” to do an element-by-element comparison in a single instruction, then pick the overall winner using direct comparisons afterwards:

        float *data;
        float m0,m1;
        __vector float v0,v1,v2;

                v0 = *(__vector float *)(&data[ 0]);
                v1 = *(__vector float *)(&data[ 4]);
                v0 = vec_max(v0,v1);

                m0 = v0[0] > v0[1] ? v0[0] : v0[1];
                m1 = v0[2] > v0[3] ? v0[2] : v0[3];

                m0 = m0 > m1 ? m0 : m1;

In this simple example, instead of a loop which would perform seven individual comparisons plus associated conditional to save a new local maximum, the vector code does one vector comparison, three individual comparisons, and three conditionals.  The improvement is more significant with larger arrays.  A size-16 array would do 15 comparisons and 15 conditionals non-vectorized, but only three vector comparisons, three individual comparisons, and three conditionals.

Note that it is very important for the data to be aligned properly.  For the 128-bit (16 byte) vectors, the data should be aligned on 16 byte boundaries.

If using the vec_ld macros, use the vec_vsx_ld macros instead, which allow for the use of all 64 of the VSX registers.

Mathematical Acceleration SubSystem (MASS)

MASS is a set of math libraries supporting a common set of elementary functions with single- and double-precision interfaces for scalar, vector, and SIMD arguments. These libraries provide significant performance improvements over the equivalent standard math library (libm) functions, often 5-10x or better for many functions, at the expense of a very slight reduction in accuracy, and there are versions of the SIMD/Vector libraries for specific Power architecture levels, including POWER8.

div, log, exp, pow, and many, many more … vector versions, too.

For a list of functions, performance and accuracy information, see http://www-01.ibm.com/software/awdtools/mass/aix/mass-aix.html, which is ostensibly for AIX, but should apply identically for Linux.

Avoid I/O

  • Avoid any I/O on the critical path, if at all possible.
  • Delay any I/O until after the critical path completes, if possible.
  • Push any I/O to a different lower-priority thread.
  • Pre-compute and pre-format as much of the I/O data as possible before the critical path.
    Instead of:

    <begin critical section>
    buffer = malloc(21);
    sprintf(buffer, "<BEGIN>%8x<END>",data);
    <end critical section>

    Do this:

    buffer = malloc(21);
    memcpy(buffer, "<BEGIN>        <END>", len);
    <begin critical section>
    sprintf(&buffer[7],"%8x",data);
    <end critical section>
  • Use async I/O.  Submit the I/O to the kernel for scheduling, but return from the kernel immediately and continue processing on the critical path as soon as possible.  Note that async I/O uses unbuffered file descriptors (not FILE pointers), so adopting it may be slightly more complex than a plug-and-replace, depending on your current implementation.  Here’s what looks like a decent reference: http://www.ibm.com/developerworks/library/l-async/

Minimize and avoid locking

  • Locking is expensive, especially if there is contention forcing actual wait time.
  • Make the critical sections for locks as small as possible.  hold a lock for as short a duration as possible.
  • Consider lockless algorithms (user-space RCU).

Consider Transactional Memory

Transactional Memory (TM or HTM for “hardware transactional memory”) is a means for telling the processor that certain sections of memory (larger than the intrinsic data types like bytes, shorts, ints, longs, doubles) should be treated atomically: updates to these areas within critical sections are either entirely successful, or they fail.  Transational Memory can be used instead of heavyweight locking primitives like mutexes to great performance advantages.

There is a nice example in Performance Optimization and Tuning Techniques for IBM Processors, including IBM POWER8.

Reduce lock contention.

Global locks under high contention: consider multi-level locks (chip, node, system).

Avoid context switches.

  • Avoid interprocess communications when multiple threads or direct call is possible.
  • Avoid system calls.
  • POWER has a very large register set, which is expensive to save/restore on context switches.

Branch / Switch / If-then-else

  • Small switch/case statements may perform better as a conditional branch tree vs a indirect branch table –fno-jump-tables.
  • Replace moderately sized switch statements with binary tree if-then-else.
  • recvmmsg

    recvmmsg (which is distinct from recvmsg) can receive multiple messages with one syscall.

    Align data

    • Page-align moderately sized memory alloations
    • Cache-align fairly small, frequently accessed memory
    • Watch for loads which are not properly aligned (halfwords, words, doublewords)

    Avoid cache loads

    • In a situation where one is completely overwriting data, it can be faster to just zero the cache line first then write, rather than have the cache line loaded then overwritten
    • dcbz instruction

    Explicit prefetching

    • dcbt, dcbtst instructions
    • You might need to turn off hardware prefetching (ppc64_cpu –dscr=1)

    Use fast interprocess communications

    • Shared memory
    • Consider threads instead
    • Not UNIX domain sockets, definitely not internet domain sockets (TCP, UDP)
    • See SPHDE

    Use efficient timekeeping

    • To measure time reliably across processors, always use CLOCK_MONOTONIC:
      clock_gettime(CLOCK_MONOTONIC, &start);
    • However, clock_gettime() currently requires a syscall.You could use gettimeofday instead.  The optimized version of gettimeofday uses a VDSO (virtual dynamic shared object) facility to avoid doing an actual syscall and requisite context switches into and out of the kernel.
    • Consider instead using sphgettimer.  see SPHDE (sphgettimer)
    • Postpone pretty-formatting, low-overhead logging, and deferred timestamp processing: see SPHDE

      SMT priority in spin loops

      Program Priority Register (PPR) support was added in POWER7 (POWER ISA v2.06 Book II, Chapter 3). It is a way to control the priority of each SMT thread and how it runs compared to other threads in the same core. For POWER7/ISA 2.06 there are three settings available to user programs: low (or r1,r1,r1), medium low (or r6,r6,r6) and medium (or r2,r2,r2). POWER8/ISA 2.07 added two more: very low (or r31,r31,r31) and medium high (or r5,r5,r5). These instructions are ignored (no-ops) on older chips. Until Linux kernel 3.9, the PPR was not set or restored during interrupts. In kernels before 3.9, the PPR will be reset to default value (medium) on any kernel interruption (syscall, irq, etc.).

      In GNU C Library (glibc) 2.18 and later, there are compiler built-ins for setting the PPR:

      #include <sys/platform/ppc.h>
      void f() {
        __ppc_set_ppr_med();
        __ppc_set_ppr_med_low();
        __ppc_set_ppr_low();
      #ifdef _ARCH_PWR8
        __ppc_set_ppr_very_low();
        __ppc_set_ppr_med_high();
      #endif
      }
      

      Note: Be careful to reset the PPR to the default value (medium) when resuming normal processing, or severe performance degradation is possible.

      C++ templates

      Consider using http://stdcxx.apache.org, which can be faster in some scenarios.

      Complex C++ template codes may exceed the default inline limit, which in turn limits inter-procedural analysis and constant propagation. Raising this limit can improve the code generated for these templated classes. –param max-inline-insns-auto=200 or 400

      C++ strings

      Vstring is the “versatile string class” provided by GCC’s libstdc++ (starting in GCC 4.1).  Read more about it in the GCC 4.1 release notes.  Vstring can help by avoiding reference counting and is optimized for small strings.

      Replace mutexes with spinlocks

      Consider replacing pthread_mutex instances with pthread_spinlock.  Spinlocks avoid the system call overhead of pthread_mutexes and are faster to lock and unlock at the potential cost of much more CPU cycles spent (spinning idly) during lock contention.  It is generally unwise to hold spinlocks for relatively long time periods.  It is also possible to experience priority inversion.

      Use synchronization instructions properly

      [Describe the proper use of sync, lwsync, hwsync, isync, eieio, etc. here, including use of compiler built-ins.]

      Common application performance issues

      • load-hit-store (avoid or postpone store, or move away from subsequent load.  even nops help)
      • isel latency
      • branch misprediction (use __builtin_expect)
      • SIMD alignment issues
      • taken branch impact
      • function unit imbalance

      Analysis tools and techniques

      • perf / oprofile / operf / ocount
      • ftrace
      • strace / ltrace
      • instrumentation
      • SDK
      • CPI Breakdown Tool

Join The Discussion

Your email address will not be published. Required fields are marked *