Taxonomy Icon

Linux

Introduction

I had the good fortune to be selected to present at the recent OpenPOWER Summit on the topic of tools for porting and tuning for Linux on Power. The time slots for the presentations were fairly short (maximum of 30 minutes), but there was a lot I wanted to cover. So, I did my best to highlight the tools I felt had the most value, and perhaps lacked general awareness. A video of the presentation can be found at https://www.youtube.com/watch?v=PJwnfDSHOLI. (There are lots of great presentations from the OpenPOWER Summit US 2018 appearing on the OpenPOWER Foundation’s YouTube channel at: https://www.youtube.com/channel/UCNVcHm09eXVbvUzZkQs0_Sg.)

Instead of listening to me drone on for 30+ minutes, what follows is the basic content of the presentation in textual form, not verbatim and slightly more detailed – a better reference.

As a general note, I classify the use-cases for the tools in the following three ways:

  1. Deploy icon: The user has the source code available and is willing and able to change the source code for portability and performance advantage; the code can be recompiled and relinked; there is a representative performance scenario which can be run for analysis.
  2. Deploy icon: The user has the source code, but may not be willing or able to change it; the code can be recompiled or relinked; there is a representative performance scenario which can be run for analysis.
  3. Deploy icon: Source code is not required; neither recompiling nor relinking is required; there is a representative performance scenario which can be run for analysis.

Advance Toolchain

Deploy icon

The IBM® Advance Toolchain is a software suite containing the latest releases of compilers, various libraries, and various tools related to application porting, tuning, and debugging. Recent releases of these components include support for the latest features and the latest optimizations for OpenPOWER and IBM POWER® processor-based platforms. The purpose of the Advance Toolchain is to make these more modern software components available on distributions which only provide significantly older releases. The distributions, justifiably, are reluctant to change major components of the operating system like compilers and system libraries as the risk to stability is not worth the opportunity for better performance. Some distributions have made strides in providing developers with much more recent components as a developer toolset that provides a later compiler and its prerequisites. The Advance Toolchain goes farther in providing not only the latest compilers but also the latest releases of many system libraries. In addition, those system libraries are built with the latest compilers. In this way, applications built with the Advance Toolchain benefit not only from new optimizations in the latest software, but also by having that software compiled with the new compiler.

Further, the Advance Toolchain is where new compatibility features may appear first. For example, there is an ongoing effort under the auspices of the GCC project to provide compatible implementations of the Intel vector intrinsics. Those will not appear in major Linux distributions for a year or more, but are already appearing in the Advance Toolchain.

The Advance Toolchain is supportable through IBM Support Line. Updates with bug fixes and security-related fixes are released often. It is available for free download, and is entirely open source at https://github.com/advancetoolchain/advance-toolchain.

One caveat is that when an application is built with the Advance Toolchain, it then has a dependency on the Advance Toolchain runtime. So, if that application is to be deployed elsewhere, the Advance Toolchain runtime package must be installed there as well. Because the runtime is free, this is not of significant concern, but something of which to be aware.

Best practices:

  • Better: At a minimum, use the latest of any distribution-provided developer toolset to get a recent release of the compiler.
  • Best: Use the Advance Toolchain to get the latest release of the compiler, libraries, and tools; plus, those libraries built with the latest release of the compiler!

XL compilers

Deploy icon

IBM XL compilers are IBM’s flagship proprietary compiler suite, used for reporting SPEC benchmark results on IBM AIX®, IBM z/OS®, and Linux on Power. The IBM XL compiler development team works closely with the IBM Research team to incorporate the very best optimization techniques for performance advantage. Recently, the IBM XL C/C++ compiler switched to use a source code parser (front end) based on Clang, allowing the IBM XL C/C++ compiler to significantly improve source code compatibility with GCC and LLVM. Also, most common command-line options for GCC are also supported by the XL C/C++ compilers.

There are two variants of IBM XL C/C++ for Linux:

The IBM XL compilers can very well make use of the Advance Toolchain libraries and tools. In a sense, you can get the best of both worlds by using IBM’s flagship proprietary compiler with the latest fully optimized libraries. (Note that this will impose a dependency on the Advance Toolchain runtime.)

Listing 1. XL compilation using IBM Advance Toolchain libraries


$ xlc a.c -F/opt/at11.0/scripts/xlC-16_1_0-AT11_0.dfp.cfg

Beyond standard compilation, the IBM XL compilers also offer several advanced features which can be used to performance advantage, one of which falls in the gray box category: automatic parallelization. If this is enabled using a command line option, code can be generated to automatically use the multithreading capabilities of the Power system for performance advantage.

Deploy icon

The IBM XL compilers also include the following advanced features that can be used to further exploit the capabilities of the Power systems with source code changes:

  • Transparent exploitation of GPU resources by taking advantage of the OpenMP 4.5 support
  • High performance optimized math libraries (ESSL, BLAS)
  • Optimization reports that can indicate areas of the code in which optimization opportunities could be increased with changes

Best practice:

  • Try using the IBM XL C/C++ for Linux Community Edition. It is free, and its compatibility with GCC should make it a drop-in replacement simply by changing the PATH. If the performance advantage is significant, consider adopting the fully licensed and supported version for integration into a production build environment.

IBM FDPR

Deploy icon

IBM Feedback Directed Program Restructuring (IBM FDPR®) is an offering from IBM Research to optimize an existing binary without the need for source code changes, recompiling, or relinking. This is also known as post-link optimization. The tool uses a three-step process:

  1. Instrument the binary.
  2. Profile a representative performance scenario.
  3. Create a new, optimized binary.

This tool is labeled gray box even though it doesn’t require recompiling or relinking because the binary must be linked such that relocation information is preserved. Because this is not the default behavior of the linker, it is usually the case that the binary needs to be relinked with -emit-relocs.

There are open source wrapper scripts in the fdpr_wrap package. Both FDPR and fdpr_wrap packages can be found at https://developer.ibm.com/linuxonpower/sdk-packages.

An example of use:

Listing 2. FDPR post-link optimization


$ gcc -o load -O3 -mcpu=power8 load.c -Wl,--emit-relocs
$ ./load
Total run time for 10000 iterations was: 3.579205 seconds
$ /opt/ibm/fdprpro/bin/fdpr_instr_prof_opt load
FDPR profiling: /home/pc/load-2.1pc/load.instr  ...
Total run time for 10000 iterations was: 26.036891 seconds
$ ./load.fdpr
Total run time for 10000 iterations was: 2.130412 seconds

Deploy icon

FDPR can also generate a journal file, which includes suggestions for source code changes or compiler flags that could be implemented for performance improvement. Implementing these changes makes the improvements permanent, in that FDPR would not be required to realize them in subsequent builds of the program. Generating the journal is similar to the optimization example, above. Note that in this case, the compilation step adds the -g flag to generate debugging information so that FDPR can report line numbers in the journal:

Listing 3. FDPR generating journal


$ gcc -o load -Wl,--emit-relocs -O3 -mcpu=power8 -g load.c



$ /opt/ibm/fdprpro/bin/fdpr_instr_prof_jour load



$  ls -1tr | tail -1
  load_jour.xml

An example extract of journal contents follows:

Listing 4. FDPR journal


<operation name="Unroll loop">
  <problem>High branch penalty in a small loop</problem>
  <solution>Compiler: the loop is unrolled N times so it fits within a cache line</solution>
  <site>
    <ip>100006a4</ip>
    <dir>/home/pc/load-2.1pc</dir>
    <file>load.c</file>
    <fn>main</fn>
    <line>15</line>
    <xcount>132900736</xcount>
  </site>

In this example, FDPR has found a place where unrolling a loop would be advantageous, which suggests adding a compiler option like -funroll-loops.

Best practices:

  • Deploy icon After completing all other source code changes and compilation-related optimization, generate an FDPR journal to see if there are more opportunities to improve performance with source code changes. Validate the results of any changes, as not all suggestions will necessarily result in performance improvement.
  • After completing all other source code changes and compilation-related optimization, use FDPR on the final program binaries to squeeze a bit more performance out of the program.

IBM SDK for Linux on Power

Deploy icon and Deploy icon

The IBM SDK for Linux on Power (the SDK) is a full-featured integrated development environment (IDE) based on Eclipse with IBM-developed plug-ins aimed at porting and tuning C and C++ applications for Linux on Power.

The SDK runs exclusively on Linux, and can run on x86 or Power systems. It can be made to run on Microsoft® Windows® or MacOS, or anywhere on which a sufficient Linux environment can be established with virtualization or emulation. It can run in several modes:

  • Complete IDE and development environment on a Power system, with remote display to a desktop through VNC, Secure Shell (SSH) tunneling, X-Windows network protocols, or other remote desktop or application display techniques.
  • Complete IDE and development environment on an x86 system, using cross-compilers and IBM POWER® emulation (QEMU, Power Functional Simulator).
  • IDE on x86 system and development environment on Power by way of “remote synchronized project” capabilities of the SDK. In many ways, this is the best of both worlds, with the advantages of better interactivity of a local IDE interface and seamless access to a POWER processor-based development environment.

Information about some of the more commonly used plug-ins is given below. I also created a tutorial for how to make the best use of the SDK at: https://developer.ibm.com/linuxonpower/tutorials/sdk_linux_on_power.

In addition, there are several videos showing how to use many of the plug-ins at: https://www.youtube.com/playlist?list=PLXUHQs-GmUIRbCsNtjcTAIFpZdvmhEPjq.

The SDK is free to download and use at https://developer.ibm.com/linuxonpower/sdk.

Best practices:

  • If you have the source code, try the SDK. It is free, easy to install, and easy to use.
  • For porting, start with the SDK Migration Advisor.
  • Enable the Build Advisor and consider its recommendations.
  • After performing any other performance analysis, including the use of perf or OProfile within the SDK, use the SDK Source Code Advisor as a final pass for performance opportunities.
  • For deeper inspection, use the SDK CPI Breakdown and Drill-down (see below) to look for inefficiencies in the generated code.

SDK Migration Advisor

Deploy icon

The SDK Migration Advisor is an included plug-in for the SDK, which scans the imported source code of a project looking for potential portability issues. The list of issues that Migration Advisor seeks includes:

  • #ifdef x86
  • Non-portable system calls, APIs, built-ins, assembly
  • long double, Float128
  • Non-portable hardware transactional memory use
  • char default signedness
  • 32/64 bit
  • sync-style built-ins
  • Endian issues

The Migration Advisor is very simple to run: after the source is imported into a new project, right-click the project and click Run Migration Advisor from the project’s context menu.

When the Migration Advisor completes, it produces a report with the issues found.

Figure 1. Migration Advisor results

Double-click an issue to go directly to the source code line on which the issue was detected. Further, when the cursor hovers over that line, a pop-up box appears describing the issue in more detail.

Figure 2. Migration Advisor: Pop-up window

Finally, if you click Built-in quick fix, a portable fix for the issue is implemented directly in the code!

While this is extremely convenient, it can be tedious for code in which a large number of issues are detected. There is a feature to automatically implement all fixes for which the Migration Advisor has high confidence.

Figure 3. Migration Advisor: Automate

The Apply basic fixes automatically option can save a significant amount of time, especially for large projects with many detected issues.

Figure 4. Migration Advisor: Automated results

SDK Source Code Advisor

Deploy icon

The SDK Source Code Advisor uses IBM FDPR to perform run-time performance analysis and report potential issues that might be ameliorated with source code or makefile changes.

Figure 5. Source Code Advisor results

Double-clicking the source line in the report displays the source code. Hovering over the source line in the code displays a pop-up box with a similar description and a hyperlink to apply a fix to the code.

Figure 6. Migration Advisor: Pop-up window

Best practice:

  • Use the SDK Source Code Advisor as a final pass after tuning the application by other means; apply recommended fixes; and validate the results, as not all suggestions will necessarily provide performance benefit.

SDK CPI Breakdown and Drill-down

Deploy icon

Cycles per instruction (CPI) is a measurement of the average number of processor cycles used to execute each instruction of a program. This is the efficiency with which a program is running on a processor or a set of processors. This metric is certainly interesting, but in isolation does not convey any information about how to improve the efficiency of the program. Most modern processors have the ability to monitor a large and diverse set of hardware events which comprise the CPI value. Further, some processors support more specific events which are themselves components of more general events. In this way, a hierarchical breakdown of the CPI value can be generated. This provides valuable insight into what type of inefficiencies a program is encountering. The SDK CPI Breakdown tool will run a given representative performance scenario for the program and graphically display the CPI and associated hierarchical breakdown.

Figure 7. CPI Breakdown results

There is a lot of information displayed in a small space, and the image may be difficult to read (it is far more legible in the application). At the far left is a legend explaining which components are measured events (light gray), and which are calculated metrics (darker gray). Also, the raw run CPI and stall CPI measurements are shown. The r__un CPI measurements are shown for when instructions are proceeding and the stall CPI measurements are shown for when instructions have been held up for some reason. Moving to the right, the first column is the overall cycles (PM_RUN_CYC) for which the program was actively running (scheduled, whether instructions are progressing or stalled). The next column breaks the overall number of cycles into subsets for instructions that are progressing and stalling. These categories may vary by processor architecture. Continuing to the right, each of the preceding categories are further broken down into more specific events and metrics, which also vary by processor architecture.

Hovering over any block displays a more detailed description of the event or metric. The top three events or metrics in each level are displayed in red (first), orange (second), and pink (third). The size of each block is scaled to fit in the hierarchy, not by the magnitude of the event or metric. For a better view of relative magnitudes, the data can be displayed as a radar chart by selecting that tab.

Figure 8. CPI Radar Chart view

On the CPI Breakdown tab, double-click any of the events and notice that a fresh run of the representative performance scenario is launched for profiling based on the selected event, allowing you to drill down the most significant events. When this profiled scenario has completed, the results are displayed by source and line number.

Figure 9. CPI Drill-down results

Command-line tools from the SDK

Deploy icon and Deploy icon

GUI-based tools can be incredibly powerful, but they are not always a good fit, and they can be difficult to get right. Usability, automation, and personal or organizational preference can preclude the use of GUI tools. For those reasons, there are ongoing efforts to create command-line versions of the most valuable tools in the SDK. These are described below.

ma (Migration Advisor)

Deploy icon

Similar to the SDK Migration Advisor (described above), the command-line Migration Advisor scans the source code looking for likely portability issues. As of this writing, the list of checkers is slightly fewer for the command-line version:

  • #ifdef x86
  • Non-portable system calls, APIs, built-ins, assembly
  • long double, Float128
  • Non-portable hardware transactional memory use
  • char default signedness

Usage is very simple:

Listing 5. ma results


$ ma run src/.
================
Migration Report
================
Problem type: Non Portable Pthread
Problem description: Reports occurrences of non-portable Pthreads API
   File: ma/many.c
      Line: 3
      Problem: pthreadid_np_t tid



      Line: 4
      Problem: pthread_getthreadid_np()



   File: ma/pthread.c
      Line: 3
      Problem: pthread_id_np_t tid



      Line: 4
      Problem: pthread_getthreadid_np()



Problem type: Performance degradation
Problem description: This preprocessor can contain code without Power optimization
   File: ma/performance.c
      Line: 3
      Problem: #ifdef _x86



Problem type: Inline assembly
Problem description: Possible arch specific assembly
   File: ma/asm.c
      Line: 2
      Problem: asm("mov %ax, 0")



      Line: 3
      Problem: __asm__("mov %ax, 0")



Problem type: Long double usage
Problem description: Potential migration issue due size of long double variables in Power architecture.
   File: ma/t0.c
      Line: 3
      Problem: long double ld



   File: ma/double.c
      Line: 3
      Problem: long double ld



Problem type: Hardware Transactional Memory (HTM)
Problem description: x86 specific HTM calls are not supported in Power Systems
   File: ma/htm.c
      Line: 1
      Problem: include rtmintrin.h
      Solution: replace rtmintrin.h for htmintrin.h



      Line: 4
      Problem: _xbegin()
      Solution: replace xbegin for __builtin_tbegin



Problem type: Decimal Floating Point (DFP) API
Problem description: x86 API not supported in Power
   File: ma/dfp.c
      Line: 1
      Problem: include bid`functions.h



      Line: 6
      Problem:_bid64`pow(dfp0,dfp0)



The command-line Migration Advisor is not yet able to implement any fixes in the source code.

The project is written in Python, is open source, and can be found at: https://github.com/open-power-sdk/migration-advisor.

sca (Source Code Advisor)

Deploy icon

Similar to the SDK Source Code Advisor (described above), the command-line Source Code Advisor uses FDPR (described above) to analyze a representative performance scenario and report performance issues which could not be identified during compilation. The results are displayed in a very readable format for inspection and further manual action.

Listing 6. sca results


$ sca ./command
[Problem: UNROLL LOOP]
[Description: High branch penalty in a small loop.]
[Solution:



    Specify the GNU extension "attribute ((optimize ("unroll-loops")))"
    on the function containing the loop.
    Example:
        void attribute((optimize("unroll-loops"))) foo(void);



    Note: Unrolling loops can sometimes negatively impact performance,
    so validation of its impact is recommended.
[Reference: /home/pc/load-2.1pc/command.c:15 | Function: main | Instruction Pointer: 10000694]

The command-line Source Code Advisor is not yet able to implement any fixes in the source code.

The project is written in Python, is open source, and can be found at: https://github.com/open-power-sdk/source-code-advisor.

cpi (CPI Breakdown)

Deploy icon

Similar to the SDK CPI Breakdown tool (described above), the command-line CPI Breakdown tool will profile a representative performance scenario and report a hierarchical set of information about where the program is spending its time. Using the command-line CPI Breakdown tool is a two-step process:

  1. record: Profile the performance scenario and record relevant hardware events.
  2. display: Display the collated results in the form of a hierarchical layout of events, metrics, and their respective relative contribution to overall CPI measurement.

Use of the command-line CPI Breakdown tool is simple. The first step is to record the hardware event counts.

Listing 7. cpi record


$ cpi record ./load
[...]
$ ls -tr | tail -1
load_20180416_215506.cpi

Note that the scenario (“./load” in this example) will be run several times in succession in order to collect all relevant hardware performance events, as only a handful are collected during each run.

The second step is to display the CPI breakdown.

Listing 8. cpi display results


$ cpi display -f ./load_20180416_215506.cpi
RUN_CPI: 4.957 (100.00 %)
  STALL_CPI: 2.270 (45.80 %)
    BRU_CRU_STALL_CPI: 0.277 (5.60 %)
      BRU_STALL_CPI: 0.277 (5.59 %)
      CRU_STALL_CPI: 0.000 (0.00 %)
    FXU_STALL_CPI: 0.707 (14.27 %)
      FXU_MULTI_CYC_CPI: 0.000 (0.00 %)
      FXU_STALL_OTHER_CPI: 0.707 (14.27 %)
    VSU_STALL_CPI: 0 (0.0 %)
      VSU_STALL_VECTOR_CPI: 0 (0.0 %)
        VSU_STALL_VECTOR_LONG_CPI: 0 (0.0 %)
        VSU_STALL_VECTOR_OTHER_CPI: 0 (0.0 %)
      VSU_STALL_SCALAR_CPI: 0 (0.0 %)
        VSU_STALL_SCALAR_LONG_CPI: 0 (0.0 %)
        VSU_STALL_SCALAR_OTHER_CPI: 0 (0.0 %)
      VSU_STALL_OTHER_CPI: 0 (0.0 %)
    LSU_STALL_CPI: 1.311 (26.46 %)
      LSU_STALL_DCACHE_MISS_CPI: 1.118 (22.56 %)
        LSU_STALL_DCACHE_MISS_L2L3_CPI: 0 (0.0 %)
          LSU_STALL_DCACHE_MISS_L2L3_CONFLICT_CPI: 0.876 (17.67 %)
          LSU_STALL_DCACHE_MISS_L2L3_NOCONFLICT_CPI: 0 (0.0 %)
        LSU_STALL_DCACHE_MISS_L3MISS_CPI: 0 (0.0 %)
          LSU_STALL_DCACHE_MISS_LMEM_CPI: 0 (0.0 %)
          LSU_STALL_DCACHE_MISS_L21L31_CPI: 0 (0.0 %)
          LSU_STALL_DCACHE_MISS_REMOTE_CPI: 0 (0.0 %)
          LSU_STALL_DCACHE_MISS_DISTANT_CPI: 0 (0.0 %)
      LSU_STALL_REJECT_CPI: 0.029 (0.58 %)
        LSU_STALL_LHS_CPI: 0.000 (0.00 %)
        LSU_STALL_ERAT_MISS_CPI: 0 (0.0 %)
        LSU_STALL_LMQ_FULL_CPI: 0 (0.0 %)
        LSU_STALL_REJECT_OTHER_CPI: 0.029 (0.58 %)
      LSU_STALL_STORE_CPI: 0.002 (0.04 %)
      LSU_STALL_LD_FIN_CPI: 0.470 (9.47 %)
      LSU_STALL_ST_FWD_CPI: 0 (0.0 %)
      LSU_STALL_OTHER_CPI: 0 (0.0 %)
    NTCG_FLUSH_CPI: 0 (0.0 %)
    NO_NTF_STALL_CPI: 0.012 (0.24 %)
    OTHER_STALL_CPI: 0 (0.0 %)
  NTCG_ALL_FIN_CPI: 0 (0.0 %)
  THREAD_BLOCK_STALL_CPI: 0.440 (8.87 %)
    LWSYNC_STALL_CPI: 0.001 (0.01 %)
    HWSYNC_STALL_CPI: 0.000 (0.00 %)
    MEM_ECC_DELAY_STALL_CPI: 0 (0.0 %)
    FLUSH_STALL_CPI: 0.442 (8.92 %)
    COQ_FULL_STALL_CPI: 0.000 (0.00 %)
    OTHER_BLOCK_STALL_CPI: 0 (0.0 %)
  GCT_EMPTY_CPI: 1.245 (25.13 %)
    GCT_EMPTY_IC_MISS_CPI: 0 (0.0 %)
      GCT_EMPTY_IC_MISS_L3MISS_CPI: 0 (0.0 %)
      GCT_EMPTY_IC_MISS_L2L3_CPI: 0 (0.0 %)
    GCT_EMPTY_BR_MPRED_CPI: 0 (0.0 %)
    GCT_EMPTY_BR_MPRED_IC_MISS_CPI: 0 (0.0 %)
    GCT_EMPTY_DISP_HELD_CPI: 0.000 (0.00 %)
      GCT_EMPTY_DISP_HELD_MAP_CPI: 0.000 (0.00 %)
      GCT_EMPTY_DISP_HELD_SRQ_CPI: 0 (0.0 %)
      GCT_EMPTY_DISP_HELD_ISSQ_CPI: 0 (0.0 %)
      GCT_EMPTY_DISP_HELD_OTHER_CPI: 0 (0.0 %)
    GCT_EMPTY_OTHER_CPI: 1.245 (25.12 %)
  COMPLETION_CPI: 0.384 (7.75 %)
  OTHER_CPI: 0.618 (12.46 %)

There is a lot of information displayed. For convenience, there are command flags which can limit the output.

Listing 9. cpi display with filters


$ cpi display -f ./load_20180416_215506.cpi --top-events 5
================
Events Hot Spots
================
    PM_RUN_CYC : 30007450676
    PM_CMPLU_STALL : 13742052138
    PM_CMPLU_STALL_LSU : 7939376272
    PM_GCT_NOSLOT_CYC : 7540013513
    PM_CMPLU_STALL_DCACHE_MISS : 6770379676



$ cpi display -f ./load_20180416_215506.cpi --top-metrics 5
================
Metrics Hot Spots
================
    RUN_CPI : 4.957
    STALL_CPI : 2.270
    LSU_STALL_CPI : 1.311
    GCT_EMPTY_CPI : 1.245
    GCT_EMPTY_OTHER_CPI : 1.245

Because the ultimate goal is to narrow down where in the code adverse events are happening, there is a further convenience function that can drill down on the most frequently occurring events. New profiling runs are launched in which those specific events are recorded and the profiling information, including source, line, and potentially instruction are included in the command output.

Listing 10. cpi drilldown results


$ cpi drilldown --auto 5 --threshold 0.25 ./load 2000
    Recording CPI Events: 20/20 iterations (elapsed time: 29 seconds)
    Running drilldown with event: PM_RUN_CYC



===============================
Drilldown for event: PM_RUN_CYC
===============================
99.51% in /home/pc/load-2.1pc/load
    99.51% in main [/home/pc/load-2.1pc/load.c]
===============================



    Running drilldown with event: PM_CMPLU_STALL



===================================
Drilldown for event: PM_CMPLU_STALL
===================================
99.07% in /home/pc/load-2.1pc/load
    99.07% in main [/home/pc/load-2.1pc/load.c]



0.9% in /proc/kallsyms
    0.63% in rfi_flush_fallback [??]
===================================



    Running drilldown with event: PM_CMPLU_STALL_LSU



=======================================
Drilldown for event: PM_CMPLU_STALL_LSU
=======================================
98.3% in /home/pc/load-2.1pc/load
    98.3% in main [/home/pc/load-2.1pc/load.c]



1.67% in /proc/kallsyms
    0.92% in rfi_flush_fallback [??]
======================================



    Running drilldown with event: PM_GCT_NOSLOT_CYC



======================================
Drilldown for event: PM_GCT_NOSLOT_CYC
======================================
99.03% in /home/pc/load-2.1pc/load
    99.03% in main [/home/pc/load-2.1pc/load.c]



0.94% in /proc/kallsyms
    0.47% in rfi_flush_fallback [??]
======================================



    Running drilldown with event: PM_CMPLU_STALL_DCACHE_MISS



===============================================
Drilldown for event: PM_CMPLU_STALL_DCACHE_MISS
===============================================
99.55% in /home/pc/load-2.1pc/load
    99.55% in main [/home/pc/load-2.1pc/load.c]
===============================================

The project is written in Python, is open source, and can be found at: https://github.com/open-power-sdk/cpi-breakdown

curt

Deploy icon

There is a tool called curt on AIX that displays statistics related to system utilization. A new tool for Linux, also called curt is inspired by the AIX tool (but is otherwise unrelated).

Statistics reported by the Linux curt tool include:

  • Per-task-per-CPU user, system, interrupt, hypervisor, and idle time
  • Per-task, per-process, and system-wide user, system, interrupt, hypervisor, and idle time
  • Per-task, per-process, and system-wide utlization percentage, and migration counts
  • Per-task-per-syscall invocation counts, elapsed time, average time, minimum time, and maximum time
  • Per-task-per-HCALL invocation counts, elapsed time, average time, minimum time, and maximum time
  • Per-task-pre-interrupt counts, elapsed time, average time, minimum time, and maximum time

Use of the tool is a two-step process:

  1. Use perf record to generate a recording of relevant events:

    
    $ perf record -e '{raw_syscalls:*,sched:sched_switch,sched:sched_migrate_task,
    sched:sched_process_exec,sched:sched_process_fork,sched:sched_process_exit,
    sched:sched_stat_runtime,sched:sched_stat_wait,sched:sched_stat_sleep,
    sched:sched_stat_blocked,sched:sched_stat_iowait,powerpc:hcall_entry,
    powerpc:hcall_exit}' -a command --args
    

  2. Use the curt script to process the data recorded by the perf command: $ ./curt.py perf.data

With the most recent version of curt, both recording and reporting can be done in a single, simple step: $ ./curt.py --record allcommand

Sample output (heavily edited for brevity and clarity):

Listing 11. curt results


PID :
5020:
  [task] command    cpu      user       sys       irq        hv      busy         idle
  [5092] imjournal    6  0.288924  0.154960  0.000000  0.000000  0.000000  5001.594250
  [5092] imjournal  ALL  0.288924  0.154960  0.000000  0.000000  0.000000  5001.594250



  [task] command    cpu   runtime     sleep      wait   blocked    iowait unaccounted
  [5092] imjournal    6  0.461900  0.000000  0.000000  0.000000  0.000000  997.568960
  [5092] imjournal  ALL  0.461900  0.000000  0.000000  0.000000  0.000000  997.568960



  [task] command    cpu util% moves
  [5092] imjournal    6  0.0%
  [5092] imjournal  ALL  0.0%     0



    ( ID)name  count      elapsed      pending      average      minimum      maximum
    (  3)read      6     0.041416     0.000000     0.006903     0.002252     0.022116
    (167)poll      4  4004.103382   997.585996  1001.025845  1001.018766  1001.029046
    (221)futex     1     0.011118     0.000000     0.011118     0.011118     0.011118
    (106)stat      1     0.007298     0.000000     0.007298     0.007298     0.007298


  • [task] command cpu user sys irq hv busy idle [5093] rs:main 7 0.093216 0.072478 0.000000 0.000000 0.000000 5001.872440 [5093] rs:main ALL 0.093216 0.072478 0.000000 0.000000 0.000000 5001.872440 (newline)(newline)
  • [task] command cpu runtime sleep wait blocked iowait unaccounted | util% moves [5093] rs:main 7 0.145840 0.000000 0.000000 0.000000 0.000000 5001.872440 | 0.0% [5093] rs:main ALL 0.145840 0.000000 0.000000 0.000000 0.000000 5001.872440 | 0.0% 0 (newline)(newline) [task] command cpu util% moves [5093] rs:main 6 0.0% [5093] rs:main ALL 0.0% 0 (newline)(newline) ( ID)name count elapsed pending average minimum maximum ( 4)write 1 0.036936 0.000000 0.036936 0.036936 0.036936 (221)futex 1 0.002178 5001.905804 0.002178 0.002178 0.002178 (newline)(newline)
  • [task] command cpu user sys irq hv busy idle [ ALL] ALL 0.382140 0.227438 0.000000 0.000000 0.000000 10003.466690 (newline)(newline)
  • [task] command cpu runtime sleep wait blocked iowait unaccounted [ ALL] ALL 0.607740 0.000000 0.000000 0.000000 0.000000 5999.441400 (newline)(newline)
  • [task] command cpu util% moves [ ALL] ALL 0.0% 0

The project is written in Python, is open source, and can be found at: https://github.com/open-power-sdk/curt.

Power Functional Simulator

Deploy icon, Deploy icon, and Deploy icon

The Power Functional Simulator is a full-system simulator for POWER. This very powerful tool provides a complete POWER environment when a POWER processor-based system is otherwise unavailable or impractical. Given an image of an installed file system, it will boot through firmware and operating system to a login prompt for a POWER processor-based development environment.

The Power Functional Simulator can be found at: https://developer.ibm.com/linuxonpower/sdk-packages.

Convenient wrapper scripts, which greatly simplify getting an environment established, can be found at: https://github.com/open-power-sdk/power-simulator.

Listing 12. Power Functional Simulator boot


[x86-laptop]$ mambo -s power9  #(edited for brevity...)
You are starting the IBM POWER9 Functional Simulator
When the boot process is complete, use the following credentials to access it via ssh:
ssh root@172.19.98.109
password: mambo
Licensed Materials - Property of IBM.
(C) Copyright IBM Corporation 2001, 2017
All Rights Reserved.
Using initial run script /opt/ibm/systemsim-p9/run/p9/linux/boot-linux-le-skiboot.tcl
Starting mambo with command: /opt/ibm/systemsim-p9/bin/systemsim-p9 -W -f
/opt/ibm/systemsim-p9/run/p9/linux/boot-linux-le-skiboot.tcl
Found skiboot skiboot.lid in current directory
Found kernel vmlinux in current directory
Found disk image disk.img in current directory
Booting with skiboot ./skiboot.lid.....
Booting with kernel ./vmlinux.....
root disk ./disk.img
INFO: 0: (0): !!!!!! Simulator now in TURBO mode !!!!!!
OPAL v5.7-107-g8fb78ae starting...
[...]
Linux version 4.13.0-rc4+ (pc@moose1.pok.stglabs.ibm.com) (gcc version 4.8.5 20150623 (Red Hat
4.8.5-11) (GCC)) #2 SMP Fri Aug 18
17:01:57 EDT 2017
[...]
Debian GNU/Linux 9 mambo ppc64le 172.19.98.109
mambo login:

The environment provided by the Power Functional Simulator is single core, single thread. There are simulators available for POWER8 and POWER9. A simulator is a great way to begin getting experience with Linux on Power, beginning a porting effort, and even for establishing a robust cross-compilation and runtime environment.

Performance Simulator

Deploy icon

The Performance Simulator is a cycle-accurate POWER instruction stream reporting tool. It transforms a POWER instruction trace into a report where the various stages of each cycle of every instruction’s lifetime is reported. The resulting reports can be viewed with one of the viewers which is included with the Performance Simulator package.

Instruction traces can be captured using the itrace function of the Valgrind tool suite. Valgrind itrace is not included with the Valgrind that comes with Linux distributions. However, itrace is available with the Valgrind that comes with the IBM Advance Toolchain (see above).

Using the Performance Simulator is a three-step process:

  1. Record instruction trace (.vgi file):

    $ valgrind -tool=itrace -binary-outfile=tracefile.vgi --num-K-insns-to-collect=100 -demangle=no command

  2. Create a .qt format file from the .vgi file:

    $ vgi2qt -f tracefile.vgi -o tracefile.qt

  3. Run the Performance Simulator timer (.pipe file):

    $ /opt/ibm/sim_ppc/sim_p8/bin/run_timer tracefile.qt 100000 10000 1 tracefile -scroll_pipe 1 -scroll_begin 1 -scroll_end 100000

You can view the resulting instruction timing report with one of the two included viewers (scrollpv and jviewer).

scrollpv

Deploy icon

The scrollpv viewer displays the instruction cycle report.

Figure 10. scrollpv display

Each phase of each instruction’s lifetime is shown visually in the main pane, lower left. The instruction disassembly is two columns to the right. Hovering over any of the character mnemonics in the main pane will display an explanation in the text area near the top of the window. In the example, the cursor is over an ‘s’ near the center of the pane. The corresponding explanation is “cannot issue sources not ready”. This instruction is currently waiting for its operands to be made available from the processing of a previously executed instruction.

jviewer

Deploy icon

Like scrollpv, jviewer also displays the instruction cycle report.

Figure 11. jviewer display

jviewer is very similar to scrollpv. Which viewer to use is personal preference.

pipestat

Deploy icon

The pipestat tool operates on the output of the Performance Simulator cycle-accurate timer, and performs detailed analysis, reporting the following information:

  • Most executed loops, including misaligned short loops
  • Most executed blocks with long latency instructions, redundant loads
  • Most executed incorrectly hinted branches
  • Most executed mispredicted and frequently mispredicted branches
  • Most executed code paths that have a store followed soon after by a load of the same address
  • Most load-hit-store related events on a particular instruction address
  • Most executed instructions where the result is used a small number of instructions later but takes a large number of cycles before the dependent instruction starts

The pipestat tool produces a lot of useful output. Extracted below are a few snippets.

Listing 13. pipestat results


HOT execution count blocks:
  0x000004025c98-0x000004025cac N:678 6 inst trace inst 297



HOT misaligned short loops:
  0x00000400ea58-0x00000400ea74 N:170 8 inst short misalign32
Loop size summary data:
Instrs  loops  total iter  min iter  max iter  avg iter  total inst  % of trace
   6        1         402       402       402    402.00        2412        2.49



HOT Loop constructs (5 total):
Header blk IA  arch inst static dynamic iter inst/iter nodes taken !taken  BL  BLR
0000000400eb40     112        6    2412  450       5.4     1  0.00   0.00   0    0
                 BkEdge: bc_tk bc_nt fallth     Edge:  bc_tk  bc_nt   br fallth
                           402     0      0                0     0     0      0



HOT long latency instruction blocks:
  0x00000400ea7c-0x00000400ea90 N:48 6 inst badness 240



HOT redundant loads:
 intra+stack: 261 (0.27%) inter+stk: 552 (0.57%) intra: 0 (0.00%) inter: 0 (0.00%)
  0x00000400e0d4-0x00000400e0f4 N:200 redundant loads 487



HOT bad branch hints:
0x000004025dc4 hint likely not taken but was taken 36.47% (450/1234)



branch mispredict summary
  16444 branches 1271 mispredict 1240 penalty samples avg penalty 24.7cy total penalty 30618cy



HOT branch mispredict count:
0x000004025dc4 mispredict 450 (36.5% of 1234) pen 17.0cy avg over 450 LBE 0.0081/0.1153



HOT branch mispredict frequency:
0x00000400e0a4 mispredicted 100.0% of 1 pen 17.0cy avg over 1
0x00000400e824 mispredicted 47.3% of 110 pen 19.8cy avg over 52



HOT branches with high linear branch entropy and executed frequently
  0x00000400e824 N:110      LBE 0.6909/0.0000
  0x00000400e2b4 N:240      LBE 0.4333/0.0455



HOT load hit store separated by less than 100 instructions:
                      of all exec:  Pathlength     Std. red other   AGEN
ST IA   LD IA  Count   %  count min count  Avg max Dev. LDs store regs Store Values
400df64 400e0dc  200 100.0  200  13    87 34.2  68 358   87 2/2/2 st: G1 ld: G1
Total LHS events: 3282 15.1% of loads 20.0% of stores

The information displayed may be a bit cryptic, but there is comprehensive documentation that comes with pipestat. The pipestat tool can be found at https://developer.ibm.com/linuxonpower/sdk-packages.

pveclib

Deploy icon

No Instruction Set Architecture (ISA) can have every possible useful vector instruction and an associated compiler built-in. The pveclib project provides some well-crafted implementations of useful vector functions which are not part of the POWER ISA. A sampling of the functions includes:

  • udiv_qrnnd
  • fxu_bcdadd, fxu_bcd_sub
  • vec_BCD2DFP, vec_DFP2BCD
  • vec_bcdadd, vec_bcdsub, vec_bcdmul, vec_bcddiv
  • vec_shift_leftdo
  • vec_isalpha, vec_isalnum, vec_isdigit
  • vec_toupper, vec_tolower
  • vec_absdub
  • vec_revq, vec_revd, vec_revw, vec_revh
  • vec_clzq, vec_popcntq
  • vec_sldq, vec_srqi, vec_srq, vec_slqi, vec_slq
  • vec_pasted
  • vec_mulouw, vec_muleuw, vec_mulosw, vec_mulesw
  • vec_adduqm, ...

The pveclib project is open source and can be found at https://github.com/open-power-sdk/pveclib.

SPHDE

Deploy icon

The Shared Persistent Heap Data Environment (SPHDE) provides some advanced, high-performance, cross-platform implementations of functionality useful in a multiprocess or multithreaded application environment:

  • Shared Address Space (SAS): A shared-memory implementation in which the virtual addresses of data are common, so interprocess communications can freely pass pointers to data
  • Shared Persistent Heap: A multiprocess environment can allocate memory dynamically in the Shared Address Space
  • Lockless Logger: A multithreaded or multiprocess environment can safely and locklessly use an in-memory logging capability for recording events with minimal application performance impact
  • Lockless producer-consumer queue: Message passing between multiple processes can be fast and efficient, avoiding the use of locks and requiring zero copying
  • Fast timestamps: Instead of using expensive system calls or somewhat less expensive virtual dynamic shared objects (VDSOs), a single instruction can be used to access a system-wide synchronized timebase register, which is significantly faster.

SPHDE is open source and can be found at https://github.com/sphde/sphde.

LPCPU

Deploy icon

The Linux Performance Customer Profiling Utility (LPCPU) is a free download which can be used to collect system information and system performance-related data. Use of LPCPU is a three-step process:

  1. System-wide data is collected for a specified period of time. The information collected is gathered into a compressed .tar file.

    
    # tar -xjf lpcpu.tar.bz2
    # cd lpcpu
    # ./lpcpu.sh duration=150 extra_profilers="perf tcpdump"
    [...]
    Packaging data...data collected is in /tmp/lpcpu_data.hostname.default.2018-02-26_1000.tar.bz2
    

  2. The .tar file can then be offloaded to another system for post-processing and analysis.

    
    $ tar -xf ./lpcpu_data.hostname.default.2018-02-26_1000.tar.bz2
    $ cd ./lpcpu_data.hostname.default.2018-02-26_1000
    $ ./postprocess.sh
    

  3. Point a browser at the resulting summary.html file.

Data collected includes output from the following common tools:

  • iostat
  • mpstat
  • vmstat
  • perf or OProfile
  • meminfo
  • top
  • sar
  • /proc/interrupts
  • tcpdump
  • Kernel trace
  • Hardware performance counters
  • netstat
  • ...

The post-processing steps create numerous charts and interactive graphs.

Figure 12. LPCPU CPU utilization graph 1

Figure 13. LPCPU CPU utilization graph 2

Figure 14. LPCPU Interrupts graph

Figure 15. LPCPU Dirty Memory graph

The LPCPU tool is open source and can be freely downloaded at http://ibm.co/download-lpcpu.

Support

Formal support is available for the Advance Toolchain and the XL compilers.

The Linux on Power Developer Portal is a hub for all sorts of information about hardware and software components in the Linux on Power ecosystem. Refer to: https://developer.ibm.com/linuxonpower. (Note: This portal replaces the IBM developerWorks Linux on Power Community.) Within the Portal, informal support can be requested by asking questions on the dW Answers forum at: https://developer.ibm.com/answers/smartspace/linuxonpower/index.html).

Community support options are also available at the following well-known websites:

Best practices

Given all the nifty tools mentioned, where should one start when embarking on an effort to port code to Linux on Power?

For interpreted code, including Java, Python, Perl, and shell scripts, just move the code to Linux on Power, and it should run without modification.

For compiled code, with little to no source code analysis:

  1. Start with the SDK Migration Advisor or the command-line version. The Migration Advisor will flag (and sometimes fix) portability issues in the code being ported.
  2. Build with the Advance Toolchain. Measure performance.
  3. Build with XL compilers, using the Advance Toolchain libraries. Measure performance
  4. Using the compiler that provides the best performance from steps (2) and (3), use the SDK Source Code Advisor or the >command-line version to look for remaining performance opportunities. Evaluate recommendations and implement if desired. Then, rebuild.
  5. Use IBM FDPR< to squeeze more performance from the resulting binaries

For compiled code, with source code analysis, perform the same steps as above, but between steps (3) and (4):

3.1. Use common performance analysis tools such as perf to look for hot spots in the code for careful analysis, and see if any can be explained by architectural differences between prior platforms and POWER. See “Porting to Linux on Power: 5 tips that could turn a good port into a great port” at https://www.ibm.com/developerworks/library/l-port-x86-lop/. Apply mitigation if possible.

3.2. Consider using the SDK CPI Breakdown and Drill-down tool or the command-line version for deeper architecture-specific analysis. Drill-down on higher-frequency hazards. Apply mitigation if possible.

For very deep analysis, consider using the pipestat command for fine-grained instruction-level analysis.

As always, consider alternative approaches to common, important functionality that might be provided by SPHDE or pveclib.

For system-wide performance analysis, use LPCPU and/or curt.

Learn more

Conclusion

Hopefully, the collection of tools explained in this article provide what’s needed to get the best results as quickly as possible. If there is a gap in a tool’s functionality or documentation, or a new tool would help, or reality isn’t meeting expectations, ask in one of the support channels! We’re here to help!