I had the good fortune to be selected to present at the OpenPOWER Summit US 2018 on the topic of tools for porting and tuning for Linux on Power. The time slots for the presentations were fairly short (maximum of 30 minutes), but there was a lot I wanted to cover. So, I did my best to highlight the tools I felt had the most value, and perhaps lacked general awareness. A video of the presentation can be found at https://www.youtube.com/watch?v=PJwnfDSHOLI. (There are lots of great presentations from the OpenPOWER Summit US 2018 appearing on the OpenPOWER Foundation's YouTube channel at: https://www.youtube.com/channel/UCNVcHm09eXVbvUzZkQs0_Sg.)
Instead of listening to me drone on for 30+ minutes, what follows is the basic content of the presentation in textual form, not verbatim and slightly more detailed - a better reference.
As a general note, I classify the use-cases for the tools in the following three ways:
: The user has the source code available and is willing and able to change the source code for portability and performance advantage; the code can be recompiled and relinked; there is a representative performance scenario which can be run for analysis.
: The user has the source code, but may not be willing or able to change it; the code can be recompiled or relinked; there is a representative performance scenario which can be run for analysis.
: Source code is not required; neither recompiling nor relinking is required; there is a representative performance scenario which can be run for analysis.
Advance Toolchain
The IBM® Advance Toolchain is a software suite containing the latest releases of compilers, various libraries, and various tools related to application porting, tuning, and debugging. Recent releases of these components include support for the latest features and the latest optimizations for OpenPOWER and IBM Power® processor-based platforms. The purpose of the Advance Toolchain is to make these more modern software components available on distributions which only provide significantly older releases. The distributions, justifiably, are reluctant to change major components of the operating system like compilers and system libraries as the risk to stability is not worth the opportunity for better performance. Some distributions have made strides in providing developers with much more recent components as a developer toolset that provides a later compiler and its prerequisites. The Advance Toolchain goes farther in providing not only the latest compilers but also the latest releases of many system libraries. In addition, those system libraries are built with the latest compilers. In this way, applications built with the Advance Toolchain benefit not only from new optimizations in the latest software, but also by having that software compiled with the new compiler.
Further, the Advance Toolchain is where new compatibility features may appear first. For example, there is an ongoing effort under the auspices of the GCC project to provide compatible implementations of the Intel vector intrinsics. Those would not have appeared in major Linux distributions for a year or more, but have already appeared in the Advance Toolchain.
The Advance Toolchain is supportable through IBM Support Line. Updates with bug fixes and security-related fixes are released often. It is available for free download, and is entirely open source at https://github.com/advancetoolchain/advance-toolchain.
One caveat is that when an application is built with the Advance Toolchain, it then has a dependency on the Advance Toolchain runtime. So, if that application is to be deployed elsewhere, the Advance Toolchain runtime package must be installed there as well. Because the runtime is free, this is not of significant concern, but something of which to be aware.
Best practices:
Better: At a minimum, use the latest of any distribution-provided developer toolset to get a recent release of the compiler.
Best: Use the Advance Toolchain to get the latest release of the compiler, libraries, and tools; plus, those libraries built with the latest release of the compiler!
XL compilers
IBM XL compilers are IBM's flagship proprietary compiler suite, used for reporting SPEC benchmark results on IBM AIX®, IBM z/OS®, and Linux on Power. The IBM XL compiler development team works closely with the IBM Research team to incorporate the very best optimization techniques for performance advantage. Recently, the IBM XL C/C++ compiler switched to use a source code parser (front end) based on Clang, allowing the IBM XL C/C++ compiler to significantly improve source code compatibility with GCC and LLVM. Also, most common command-line options for GCC are also supported by the XL C/C++ compilers.
The IBM XL compilers can very well make use of the Advance Toolchain libraries and tools. In a sense, you can get the best of both worlds by using IBM's flagship proprietary compiler with the latest fully optimized libraries. (Note that this will impose a dependency on the Advance Toolchain runtime.)
Listing 1. XL compilation using IBM Advance Toolchain libraries
Beyond standard compilation, the IBM XL compilers also offer several advanced features which can be used to performance advantage, one of which falls in the gray box category: automatic parallelization. If this is enabled using a command line option, code can be generated to automatically use the multithreading capabilities of the Power system for performance advantage.
The IBM XL compilers also include the following advanced features that can be used to further exploit the capabilities of the Power systems with source code changes:
Transparent exploitation of GPU resources by taking advantage of the OpenMP 4.5 support
High performance optimized math libraries (ESSL, BLAS)
Optimization reports that can indicate areas of the code in which optimization opportunities could be increased with changes
Best practice:
Try using the IBM XL C/C++ for Linux Community Edition. It is free, and its compatibility with GCC should make it a drop-in replacement simply by changing the PATH. If the performance advantage is significant, consider adopting the fully licensed and supported version for integration into a production build environment.
ma (Migration Advisor)
The command-line Migration Advisor scans the source code looking for likely portability issues. As of this writing, the list of checkers is slightly fewer for the command-line version:
#ifdef x86
Non-portable system calls, APIs, built-ins, assembly
long double, Float128
Non-portable hardware transactional memory use
char default signedness
Usage is very simple:
Listing 5. ma results
$ ma run src/.
================
Migration Report
================
Problem type:Non Portable PthreadProblem description:Reports occurrences of non-portable Pthreads APIFile:ma/many.cLine:3Problem:pthreadid_np_t tidLine:4Problem:pthread_getthreadid_np()File:ma/pthread.cLine:3Problem:pthread_id_np_t tidLine:4Problem:pthread_getthreadid_np()Problem type:Performance degradationProblem description:This preprocessor can contain code without Power optimizationFile:ma/performance.cLine:3Problem:#ifdef _x86Problem type:Inline assemblyProblem description:Possible arch specific assemblyFile:ma/asm.cLine:2Problem:asm("mov %ax, 0")Line:3Problem:__asm__("mov %ax, 0")Problem type:Long double usageProblem description:Potential migration issue due size of long double variables in Power architecture.File:ma/t0.cLine:3Problem:long double ldFile:ma/double.cLine:3Problem:long double ldProblem type:Hardware Transactional Memory (HTM)Problem description:x86 specific HTM calls are not supported in Power SystemsFile:ma/htm.cLine:1Problem:include rtmintrin.hSolution:replace rtmintrin.h for htmintrin.hLine:4Problem:_xbegin()Solution:replace xbegin for __builtin_tbeginProblem type:Decimal Floating Point (DFP) APIProblem description:x86 API not supported in PowerFile:ma/dfp.cLine:1Problem:include bid`functions.hLine:6
Problem:_bid64`pow(dfp0,dfp0)
The command-line CPI Breakdown tool will profile a representative performance scenario and report a hierarchical set of information about where the program is spending its time. Using the command-line CPI Breakdown tool is a two-step process:
record: Profile the performance scenario and record relevant hardware events.
display: Display the collated results in the form of a hierarchical layout of events, metrics, and their respective relative contribution to overall CPI measurement.
Use of the command-line CPI Breakdown tool is simple. The first step is to record the hardware event counts.
Listing 7. cpi record
$ cpi record ./load
[…]
$ ls -tr | tail -1
load_20180416_215506.cpi
Show more
Note that the scenario ("./load" in this example) will be run several times in succession in order to collect all relevant hardware performance events, as only a handful are collected during each run.
Because the ultimate goal is to narrow down where in the code adverse events are happening, there is a further convenience function that can drill down on the most frequently occurring events. New profiling runs are launched in which those specific events are recorded and the profiling information, including source, line, and potentially instruction are included in the command output.
Listing 10. cpi drilldown results
$ cpi drilldown --auto 5 --threshold 0.25 ./load 2000
Recording CPI Events: 20/20 iterations (elapsed time: 29 seconds) Running drilldown with event: PM_RUN_CYC
===============================
Drilldown for event: PM_RUN_CYC
===============================
99.51% in /home/pc/load-2.1pc/load
99.51% in main [/home/pc/load-2.1pc/load.c]
===============================
Running drilldown with event: PM_CMPLU_STALL
===================================
Drilldown for event: PM_CMPLU_STALL
===================================
99.07% in /home/pc/load-2.1pc/load
99.07% in main [/home/pc/load-2.1pc/load.c]
0.9% in /proc/kallsyms
0.63% in rfi_flush_fallback [??]
===================================
Running drilldown with event: PM_CMPLU_STALL_LSU
=======================================
Drilldown for event: PM_CMPLU_STALL_LSU
=======================================
98.3% in /home/pc/load-2.1pc/load
98.3% in main [/home/pc/load-2.1pc/load.c]
1.67% in /proc/kallsyms
0.92% in rfi_flush_fallback [??]
======================================
Running drilldown with event: PM_GCT_NOSLOT_CYC
======================================
Drilldown for event: PM_GCT_NOSLOT_CYC
======================================
99.03% in /home/pc/load-2.1pc/load
99.03% in main [/home/pc/load-2.1pc/load.c]
0.94% in /proc/kallsyms
0.47% in rfi_flush_fallback [??]
======================================
Running drilldown with event: PM_CMPLU_STALL_DCACHE_MISS
===============================================
Drilldown for event: PM_CMPLU_STALL_DCACHE_MISS
===============================================
99.55% in /home/pc/load-2.1pc/load
99.55% in main [/home/pc/load-2.1pc/load.c]
===============================================
Note: Recent versions of the perf command can generate detailed CPI breakdown information:
$ perf stat --metrics cpi_breakdown ./command
The required hardware events are multiplexed during the run, so only a single run of the command is required, unlike with the cpi command. The output from perf is not as readable as with the cpi command:
There is a tool called curt on AIX that displays statistics related to system utilization. A new tool for Linux, also called curt is inspired by the AIX tool (but is otherwise unrelated).
Statistics reported by the Linux curt tool include:
Per-task-per-CPU user, system, interrupt, hypervisor, and idle time
Per-task, per-process, and system-wide user, system, interrupt, hypervisor, and idle time
Per-task, per-process, and system-wide utlization percentage, and migration counts
Per-task-per-syscall invocation counts, elapsed time, average time, minimum time, and maximum time
Per-task-per-HCALL invocation counts, elapsed time, average time, minimum time, and maximum time
Per-task-pre-interrupt counts, elapsed time, average time, minimum time, and maximum time
Use of the tool is a two-step process:
Use perf record to generate a recording of relevant events:
$ perf record -e '{raw_syscalls:*,sched:sched_switch,sched:sched_migrate_task,
sched:sched_process_exec,sched:sched_process_fork,sched:sched_process_exit,
sched:sched_stat_runtime,sched:sched_stat_wait,sched:sched_stat_sleep,
sched:sched_stat_blocked,sched:sched_stat_iowait,powerpc:hcall_entry,
powerpc:hcall_exit}' -a command --args
Show more
Use the curt script to process the data recorded by the perf command: $ ./curt.py perf.data
With the most recent version of curt, both recording and reporting can be done in a single, simple step: $ ./curt.py --record allcommand
Sample output (heavily edited for brevity and clarity):
Listing 11. curt results
PID :
5020:
[task] command cpu user sys irq hv busy idle
[5092] imjournal 60.2889240.1549600.0000000.0000000.0000005001.594250
[5092] imjournal ALL 0.2889240.1549600.0000000.0000000.0000005001.594250
[task] command cpu runtime sleep wait blocked iowait unaccounted
[5092] imjournal 60.4619000.0000000.0000000.0000000.000000997.568960
[5092] imjournal ALL 0.4619000.0000000.0000000.0000000.000000997.568960
[task] command cpu util% moves
[5092] imjournal 60.0%
[5092] imjournal ALL 0.0% 0
( ID)name count elapsed pending average minimum maximum
( 3)read 60.0414160.0000000.0069030.0022520.022116
(167)poll 44004.103382997.5859961001.0258451001.0187661001.029046
(221)futex 10.0111180.0000000.0111180.0111180.011118
(106)stat 10.0072980.0000000.0072980.0072980.007298
[task] command cpu user sys irq hv busy idle
[5093] rs:main 70.0932160.0724780.0000000.0000000.0000005001.872440
[5093] rs:main ALL 0.0932160.0724780.0000000.0000000.0000005001.872440
[task] command cpu runtime sleep wait blocked iowait unaccounted | util% moves
[5093] rs:main 70.1458400.0000000.0000000.0000000.0000005001.872440 | 0.0%
[5093] rs:main ALL 0.1458400.0000000.0000000.0000000.0000005001.872440 | 0.0% 0
[task] command cpu util% moves
[5093] rs:main 60.0%
[5093] rs:main ALL 0.0% 0
( ID)name count elapsed pending average minimum maximum
( 4)write 10.0369360.0000000.0369360.0369360.036936
(221)futex 10.0021785001.9058040.0021780.0021780.002178
[task] command cpu user sys irq hv busy idle
[ ALL] ALL 0.3821400.2274380.0000000.0000000.00000010003.466690
[task] command cpu runtime sleep wait blocked iowait unaccounted
[ ALL] ALL 0.6077400.0000000.0000000.0000000.0000005999.441400
[task] command cpu util moves
[ ALL] ALL 0.0% 0
The Power Functional Simulator is a full-system simulator for Power. This very powerful tool provides a complete Power environment when a Power processor-based system is otherwise unavailable or impractical. Given an image of an installed file system, it will boot through firmware and operating system to a login prompt for a Power processor-based development environment.
[x86-laptop]$ mambo -s power9 #(edited for brevity...)
You are starting the IBM Power9 Functional Simulator
When the boot process iscomplete, use the following credentials to access it via ssh:
ssh root@172.19.98.109
password: mambo
Licensed Materials - Property of IBM.
(C) Copyright IBM Corporation 2001, 2017
All Rights Reserved.
Using initial run script /opt/ibm/systemsim-p9/run/p9/linux/boot-linux-le-skiboot.tcl
Starting mambo with command: /opt/ibm/systemsim-p9/bin/systemsim-p9 -W -f
/opt/ibm/systemsim-p9/run/p9/linux/boot-linux-le-skiboot.tcl
Found skiboot skiboot.lid in current directory
Found kernel vmlinux in current directory
Found disk image disk.img in current directory
Booting with skiboot ./skiboot.lid.....
Booting with kernel ./vmlinux.....
root disk ./disk.img
INFO: 0: (0): !!!!!! Simulator now in TURBO mode !!!!!!
OPAL v5.7-107-g8fb78ae starting...
[...]
Linux version4.13.0-rc4+ (pc@moose1.pok.stglabs.ibm.com) (gcc version4.8.520150623 (Red Hat
4.8.5-11) (GCC)) #2 SMP Fri Aug 1817:01:57 EDT 2017
[...]
Debian GNU/Linux 9 mambo ppc64le 172.19.98.109
mambo login:
Show more
The environment provided by the Power Functional Simulator is single core, single thread. There are simulators available for Power8, Power9, and Power10 processors. A simulator is a great way to begin getting experience with Linux on Power, beginning a porting effort, and even for establishing a robust cross-compilation and runtime environment.
Performance Simulator
The Performance Simulator is a cycle-accurate Power instruction stream reporting tool. It transforms a Power instruction trace into a report where the various stages of each cycle of every instruction's lifetime is reported. The resulting reports can be viewed with one of the viewers which is included with the Performance Simulator package.
Instruction traces can be captured using the itrace function of the Valgrind tool suite. Valgrind itrace is not included with the Valgrind that comes with Linux distributions. However, itrace is available with the Valgrind that comes with the IBM Advance Toolchain (see above).
Using the Performance Simulator is a three-step process:
You can view the resulting instruction timing report with the jviewer.
jviewer
The jviewer tool displays the instruction cycle report.
Figure 10. jviewer display
Each phase of each instruction’s lifetime is shown visually in the main pane, lower left. The instruction disassembly is in the rightmost pane. Hovering over any of the character mnemonics in the main pane will display more details about it. In the example, the cursor is over a ‘u’ near the center of the pane. The corresponding explanation is “cannot issue unit not free”. This instruction is currently waiting for the branch unit to be made available from the processing of a previously executed instruction.
pipestat
The pipestat tool operates on the output of the Performance Simulator cycle-accurate timer, and performs detailed analysis, reporting the following information:
Most executed loops, including misaligned short loops
Most executed blocks with long latency instructions, redundant loads
Most executed incorrectly hinted branches
Most executed mispredicted and frequently mispredicted branches
Most executed code paths that have a store followed soon after by a load of the same address
Most load-hit-store related events on a particular instruction address
Most executed instructions where the result is used a small number of instructions later but takes a large number of cycles before the dependent instruction starts
The pipestat tool produces a lot of useful output. Extracted below are a few snippets.
Listing 13. pipestat results
HOT execution count blocks:
0x000004025c98-0x000004025cac N:6786 inst trace inst 297HOT misaligned short loops:
0x00000400ea58-0x00000400ea74 N:1708 inst short misalign32
Loop size summary data:
Instrs loops total iter min iter max iter avg iter total inst % of trace
61402402402402.0024122.49HOT Loop constructs (5 total):
Header blk IA arch inst static dynamic iter inst/iter nodes taken !taken BL BLR
0000000400eb40112624124505.410.000.0000BkEdge: bc_tk bc_nt fallth Edge: bc_tk bc_nt br fallth
402000000HOT long latency instruction blocks:
0x00000400ea7c-0x00000400ea90 N:486 inst badness 240HOT redundant loads:
intra+stack: 261 (0.27%) inter+stk: 552 (0.57%) intra: 0 (0.00%) inter: 0 (0.00%)
0x00000400e0d4-0x00000400e0f4 N:200 redundant loads 487HOT bad branch hints:
0x000004025dc4 hint likely not taken but was taken 36.47% (450/1234)
branch mispredict summary
16444 branches 1271 mispredict 1240 penalty samples avg penalty 24.7cy total penalty 30618cy
HOT branch mispredict count:
0x000004025dc4 mispredict 450 (36.5% of 1234) pen 17.0cy avg over 450 LBE 0.0081/0.1153HOT branch mispredict frequency:
0x00000400e0a4 mispredicted 100.0% of 1 pen 17.0cy avg over 10x00000400e824 mispredicted 47.3% of 110 pen 19.8cy avg over 52HOT branches with high linear branch entropy and executed frequently
0x00000400e824 N:110 LBE 0.6909/0.00000x00000400e2b4 N:240 LBE 0.4333/0.0455HOT load hit store separated by less than 100 instructions:
ofall exec: Pathlength Std. red other AGEN
ST IA LD IA Count % count min count Avg max Dev. LDs store regs Store Values
400df64400e0dc 200100.0200138734.268358872/2/2 st: G1 ld: G1
Total LHS events: 328215.1% of loads 20.0% of stores
No Instruction Set Architecture (ISA) can have every possible useful vector instruction and an associated compiler built-in. The pveclib project provides some well-crafted implementations of useful vector functions which are not part of the Power ISA. A sampling of the functions includes:
The Shared Persistent Heap Data Environment (SPHDE) provides some advanced, high-performance, cross-platform implementations of functionality useful in a multiprocess or multithreaded application environment:
Shared Address Space (SAS): A shared-memory implementation in which the virtual addresses of data are common, so interprocess communications can freely pass pointers to data
Shared Persistent Heap: A multiprocess environment can allocate memory dynamically in the Shared Address Space
Lockless Logger: A multithreaded or multiprocess environment can safely and locklessly use an in-memory logging capability for recording events with minimal application performance impact
Lockless producer-consumer queue: Message passing between multiple processes can be fast and efficient, avoiding the use of locks and requiring zero copying
Fast timestamps: Instead of using expensive system calls or somewhat less expensive virtual dynamic shared objects (VDSOs), a single instruction can be used to access a system-wide synchronized timebase register, which is significantly faster.
System-wide data is collected for a specified period of time. The information collected is gathered into a compressed .tar file.
# tar -xjf lpcpu.tar.bz2# cd lpcpu# ./lpcpu.sh duration=150 extra_profilers="perf tcpdump"
[…]
Packagingdata...data collected is in /tmp/lpcpu_data.hostname.default.2018-02-26_1000.tar.bz2
Show more
The .tar file can then be offloaded to another system for post-processing and analysis.
$ tar -xf ./lpcpu_data.hostname.default.2018-02-26_1000.tar.bz2$ cd ./lpcpu_data.hostname.default.2018-02-26_1000$ ./postprocess.sh
Show more
Point a browser at the resulting summary.html file.
Data collected includes output from the following common tools:
iostat
mpstat
vmstat
perf or OProfile
meminfo
top
sar
/proc/interrupts
tcpdump
Kernel trace
Hardware performance counters
netstat
...
The post-processing steps create numerous charts and interactive graphs.
Build with XL compilers, using the Advance Toolchain libraries. Measure performance.
Additionally, for compiled code:
Use common performance analysis tools such as perf to look for hot spots in the code for careful analysis, and see if any can be explained by architectural differences between prior platforms and Power. See "Porting to Linux on Power: 5 tips that could turn a good port into a great port" at https://developer.ibm.com/articles/l-port-x86-lop/. Apply mitigation if possible.
Consider using the cpi-breakdown tool for deeper architecture-specific analysis. Drill-down on higher-frequency hazards. Apply mitigation if possible.
For very deep analysis, consider using the pipestat command for fine-grained instruction-level analysis.
As always, consider alternative approaches to common, important functionality that might be provided by SPHDE or pveclib.
For system-wide performance analysis, use LPCPU and/or curt.
Hopefully, the collection of tools explained in this article provide what's needed to get the best results as quickly as possible. If there is a gap in a tool's functionality or documentation, or a new tool would help, or reality isn't meeting expectations, ask in one of the support channels! We're here to help!
Take the next step
Join the Power Developer eXchange Community (PDeX). PDeX is a place for anyone interested in developing open source apps on IBM Power. Whether you're new to Power or a seasoned expert, we invite you to join and begin exchanging ideas, sharing experiences, and collaborating with other members today!
About cookies on this siteOur websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising.For more information, please review your cookie preferences options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.