Taxonomy Icon

Linux

Introduction

When considering a port of an application from Linux on x86 to Linux on Power, it has been said that “Linux is Linux”, and the vast majority of code will just need to be recompiled and relinked. That is indeed true for the vast majority of code. However, it is unfortunately very easy to write code that compiles, links, and runs fine on x86 that doesn’t work on POWER processors. The usual culprits are:

  • Code that has been heavily optimized for x86
  • Code that was never expected to run on non-x86 platforms (and thus written without consideration for portability)
  • Code that has ever run only on x86 (and thus into which x86 “bias” has crept, unintentionally)

Problems arise due to any combination of these contexts. In such cases, it is immensely valuable to know the significant and subtle differences between Linux on x86 and Linux on Power, approaches for mitigation, and tools that can help to identify and mitigate issues.

Note that some issues that are identifiable by code inspection can be automatically identified with the IBM Software Development Toolkit for Linux on Power (SDK)‘s Migration Advisor, which should be considered as a very early step in a porting effort. Subtle performance issues that can be detected at run time can be identified with the SDK‘s Source Code Advisor. You can find assistance for even more subtler performance issues with the SDK‘s CPI Breakdown tool. The CPI Breakdown tool can identify areas in the code in which the processor is not being used efficiently due to various types of hazards or resource conflicts. More tools in this genre can be expected to appear in the Open Power SDK GitHub organization over time.

Cache line size

x86 POWER
cache line size (bytes) 64 128

The speed of main memory has not kept up with the increase in processor performance. As a result, processor designers incorporate some memory, called caches, that is of much higher speed (lower latency), but also much, much smaller than main memory. There are usually several levels of caches arranged in a hierarchy. A level 1 (L1) cache is the fastest (lowest latency), and smallest. Subsequent levels (L2, L3, and so on) have increasingly higher latencies, but are increasingly larger in size.

Figure 1. Illustration of a cache hierarchy

Note in this example (Figure 1) that simultaneous multithreading (SMT) threads within a core share an L1 cache, and cores within a processor share an L2 cache.

The smallest unit of memory fetched by the processor is called a cache line, because (for the purposes of this discussion) all fetched memory goes through and is stored in all levels of the cache hierarchy. There are also protocols for moving data among the levels of the cache hierarchy, as well as between caches that are dedicated to one core to caches dedicated to another core. The processors must ensure that the view of data in memory is the same for all cores on the system.

The cache line size on x86 processors is 64 bytes; on POWER processors it is 128 bytes. This difference in cache line size will not have an impact on the correctness of the program, but can have a significant impact on the performance of the program in some scenarios.

As previously mentioned, there are protocols for moving data among non-shared caches to ensure a consistent view of memory from all processors on the system. These are called cache coherency protocols. If any data in a cache line changes on one core, and another core attempts to access that data, the entire cache line must be copied to that second core’s cache hierarchy.

Performance issues can occur when two or more cores contend with each other for data in a single cache line, even if the respective ranges of memory do not overlap. For example, imagine an array of 16 64-bit (8 byte) integers, aligned on a cache line boundary. A task running on one core frequently accesses or modifies the 8th integer, and a task running on another core frequently accesses or modifies the 9th integer. There is ostensibly no conflict or contention for those two non-overlapping data locations. Further, on a system with 64-byte cache lines, the 8th integer is at the end of the first cache line, and the 9th integer is at the beginning of the second cache line, so there is no contention between the cores for their respective data. However, on a system with 128-byte cache lines, both the 8th integer and the 9th integer will reside in the same cache line. The cache coherency protocols will need to ensure that the view of this memory will be consistent between the two cores. So, every modification on one core must be reflected on the other core by copying the cache line to the cache hierarchy for the other core before its next access. This can add significant latency to those memory accesses, even though there is no explicit contention for the data.

Elimination of these types of issues requires isolating frequently accessed, closely situated, but core-specific or reasonably independent data on independent cache lines. This might include arrays of mutexes, per-core counters, mutexes with adjacent data, and so on. One way to achieve static alignment is using an attribute:


struct {
     int count attribute ((aligned(128)));
} counts[N_CPUS];

On very modern kernel and glibc, there are programmatic means to determine the processor’s cache line size. For example:


unsigned long cache_line_size;
unsigned long cache_geometry = getauxval(AT_L2_CACHEGEOMETRY);
cache_line_size = cache_geometry & 0xFFFF;

A more simple (and preferred) way is:


long cache_line_size;
cache_line_size = sysconf(_SC_LEVEL2_CACHE_LINESIZE);

Then, carefully allocate data such that each processor’s data is on its own (set of) cache lines. For example:


#define ROUND_UP(a,b) ((((a) + (b) ‑ 1) / (b)) * (b))

// calculate the size of each counter (int) when each is aligned to a cache line
unsigned long stride = ROUND_UP(sizeof(int),cache_line_size);

// get Number of CONFigured PROCESSORS
long cpus = sysconf(_SC_NPROCESSORS_CONF);

// allocate an array of counters, one per CONFigured PROCESSOR
// such that each counter is on its own cacheline
void *counters = calloc(cpus,stride);

long cpu = cpus ‑ 1; // pick a cpu (the last one)

// increment the counter for PROCESSOR #<cpu>
(*(int *)(counters + cpu * stride)) ++;

Page size

x86 POWER
Page size, default (in kilobytes) 4 64

Virtual memory management on most modern operating systems includes segmenting memory into chunks called pages. Each time the program accesses data in memory, the address of that data is mapped to a page of memory. This mapping is managed in a table called a page table, and each entry in the page table is (unsurprisingly) called a page table entry, or PTE. Mapping an address to a page is called translation. The speed of translation is obviously critical. Modern processors include assists for translation called Translation Lookaside Buffer (TLB) and other mechanisms. The size of these buffers is limited, so it is advantageous to limit the number of pages in active use by a program. If a translation fails to be resolved within the TLB (a TLB miss), the page table must be accessed directly, which is considerably slower.

Having a larger page size can be advantageous by reducing the number of pages, and thus PTEs and TLB entries, required for the same amount of memory.

x86 systems most often use memory pages of 4096 bytes (4 KB). POWER processor-based systems most often use memory pages of 65536 bytes (64 KB). In addition, on many modern systems, memory can be divided among groups of processors (nodes). A program running on one node and accessing memory on another node will wait a lot longer (higher latency) than for memory on the same node on which the program is running. This effect is called nonuniform memory access (NUMA). Modern kernels may try to migrate memory pages to the nodes from which they are most likely to be accessed, or migrate tasks to a processor on which it frequently accesses memory. This is called Automatic NUMA Balancing, which is heuristic and can be helpful or harmful depending on workload characteristics.

Thus, it can be advantageous to ensure that a program that makes use of multiple cores simultaneously does not have frequent memory access to the same page from different nodes. An obvious technique to prevent page sharing is to isolate the data being used by independent processes or tasks on unique pages of memory. A program wanting to isolate data on a page must obviously know the size of a memory page. The size of a memory page is fixed. Many programs may assume a page size of 4 KB (as on x86 systems) and align their data as such. The isolation they seek will not be realized on a POWER processor-based system.

There are programmatic means to determine the (default) memory page size being used by the operating system. For example:

unsigned long page_size = getauxval(AT_PAGESZ);

Or:

long page_size = sysconf(_SC_PAGE_SIZE);

Allocating page-aligned memory is simple. For example (no error checking is performed):


void *data;
size_t data_size = (size_of_data);
posix_memalign(&data, page_size, data_size);

Vector processing: Single-instruction, multiple-data (SIMD)

x86 POWER
Technology MMX, SSE, AVX VMX/Altivec, VSX
C includes mmintrin.h (MMX)
x mmintrin.h (SSE)
e mmintrin.h (SSE2)
p mmintrin.h (SSE3)
t mmintrin.h (SSSE3)
s mmintrin.h (SSE4.1)
n mmintrin.h (SSE4.2)
i mmintrin.h (AVX, AVX2, …)
altivec.h
Types m64*, m128, __m256, _m512, … vector signed char (or unsigned)
vector signed short (or unsigned)
vector signed int (or unsigned)
vector signed long long (or unsigned)
vector float, vector double, ...
Intrinsics _mm ( _mm_add_ps, …) vec* ( vec_add, …)

Many modern processors have the capability of processing a set (vector) of data simultaneously. This can be used for significant performance advantage. Unfortunately, the low-level processor instructions and corresponding C/C++ APIs (compiler built-in functions) are incompatible. There are emerging approaches for adding compatible implementations of the x86 vector intrinsics for POWER. See Porting x86 vector intrinsics code to Linux on Power in a hurry.

Simultaneous multithreading

x86 POWER
Threads per core (maximum) 2 POWER7 4
POWER8 8
POWER9 4 or 8
CPU enumeration primary first
{0,c} {1,c 1} …{c-1,2c}
(c = #cores)
by core
{0,1,…t-1} {t 1,t 2,…}
(t = #threads)

Many modern processors, in order to take best advantage of the large set of processor resources available, enable multiple threads of execution to run on a single core. That is, multiple programs (or threads within a program) can be running simultaneously on the same core. This is called simultaneous multithreading (SMT). One obvious advantage of this approach, as just stated, is that processor resources can be used more efficiently; there will be fewer components of a core sitting idle during processing. The disadvantages, not surprisingly, are that there can be contention for processor resources between these simultaneously active threads, or that the processor resources are partitioned among threads, so each thread has fewer resources at its disposal. The net performance impact thus depends on the workloads. In general, more threads enable higher throughput per core whereas fewer threads enable higher single-threaded performance and lower latency.

x86 systems can support up to two threads per core. IBM POWER7® supports up to four threads per core; IBM POWER8® supports up to eight threads per core; IBM POWER9™ supports up to four or eight threads per core, depending on the model.

Single-threaded workloads will work best with SMT disabled. This can often be done on x86 systems by changing BIOS settings. On POWER processor-based systems, you can put every core in the system into a single-threaded (ST, or SMT=off, or SMT=1) mode by running the following command:

# ppc64_cpu -smt=1

Because multi-threaded application performance with SMT varies significantly by workload, it is recommended to test a representative workload to determine optimal configuration. On POWER, you can similarly vary the SMT mode using the following command:

# ppc64_cpu -smt=n

where n can be 1, 2, or 4; POWER8 and some POWER9 processor-based systems also supportB 8.

For complex workloads, where it may be desired to vary the SMT mode on different cores, it is possible to disable individual threads. For example:

# echo 0 > /sys/devices/system/cpu/cpu0/online

The command above will disable cpu0, the first thread (CPU) on the first core. (The difference between a CPU and a core, and CPU enumeration is described below.) Using echo 1 will enable a CPU.

For example, it is thus possible to:

  • Enable all threads on the first core.
  • Put cores 1 through 4 in the SMT=2 mode to allow a good balance between latency and throughput for a suitable workload (that would be bound to these CPUs).
  • Put the remaining cores in the SMT=1 mode for a low latency, high performance, single-threaded multi-core workload (that would be bound to these CPUs).

This author is not aware of any trivial APIs for determining the maximum SMT mode or for mapping a core number and thread number to a CPU number. Listing 1 shows sample code to perform both in a cross-platform way and also some example uses in the main routine.

Determine threads-per-core and map {core, thread} to CPU

#include <stdio.h>
#include <unistd.h>

static int thread_enumeration_contiguous = ‑1;

int max_smt() {
  static int max_smt_save = 0;
  if (max_smt_save) return max_smt_save;

  FILE *f = fopen("/sys/devices/system/cpu/cpu0/topology/thread_siblings","r");
  if (!f) {
    max_smt_save = 1;
    return 1;
  }

  int c, b = 0, inarow = 0, maxinarow = 0;
  while ((c = fgetc(f)) != EOF) {
    int v = 0, last = 0, bit;
    if (c >= '0' && c <= '9')
      v = c - '0';
    if (c >= 'a' && c <= 'f')
      v = c ‑ 'a' + 10;
    for (bit = 0x1; bit <= 0x8; bit <<= 1) {
      if (v & bit) {
        b++;
        if (last == 1) inarow++;
        else inarow = 1;
        if (inarow > maxinarow) maxinarow = inarow;
        last = 1;
      } else {
        last = 0;
        inarow = 0;
      }
    }
  }

  thread_enumeration_contiguous = (maxinarow > 1) ? 1 : 0;

  max_smt_save = b;
  return b;
}

int core_thread_to_cpu(int core, int thread) {
  int smt = max_smt();
  int cpus = sysconf(_SC_NPROCESSORS_CONF);
  int cores = cpus / smt;
  if (thread >= smt) return ‑1;
  if (core >= cores) return ‑1;
  if (thread_enumeration_contiguous)
    return core * smt + thread;
  else
    return core + thread * cores;
}

int main(int argc, const char * const argv[]) {
  int smt = max_smt();
  printf("%d %s\n",smt,thread_enumeration_contiguous ? "contiguous" : "non‑contiguous");
  int core, thread;
  for (core = 0; core < 5; core++) {
    for (thread = 0; thread < 10; thread++) {
      printf("core %d thread %d is CPU%d\n",core,thread,core_thread_to_cpu(core,thread));
    }
  }
  return 0;
}

Sheer number of CPUs

As modern systems scale ever larger (CECs per system, sockets per CEC, chips per socket, cores per chip, threads per core), the CPU count of a system scales proportionally. A very large POWER8 processor-based system may have 192 cores. With eight threads per core, such a system has 1536 CPUs! Applications that attempt to scale across multiple CPUs are often not prepared for that level of scalability in general, and in particular with the second-order NUMA impacts, where some memory regions have higher latency than others with respect to a given CPU.

Some strategies for enhancing scalability include:

  • Hiererchical or multi-level locking schemes (for example: per core, per node, per CEC)
  • Lockless algorithms
  • Careful placement and binding of tasks (for example: related tasks on the same node)
  • Careful allocation of memory and shared memory segments (favorable NUMA placement)
  • Automatic NUMA balancing (and attention to its benefits and drawbacks)
  • Attention to inadvertent cache-line sharing and page sharing (mentioned above)

Stay tuned for more

The next part of this series is forthcoming, so stay tuned.