This document is intended to document the “state of the art” for IBM® Power Systems™ performance tuning with a specific focus on systems tuning. This document is a collection of wisdom from experience, but is not meant to be exhaustive. It is a living document, with content to be added as new information is discovered. Not all sections are complete, nor even guaranteed to be accurate. Feedback on the Forums is welcome and encouraged.
The document is divided up into the following sections:
Review the following documents for additional performance tips:
- Read Performance Optimization and Tuning Techniques for IBM Processors, including IBM POWER8
- Go at a measured pace, change one thing at a time, document what you do and the results.
- Decide what is most important (mean or median latency, maximum latency, standard deviation, and so on).
- Effective measurement: make sure the measurements are measuring the right thing(s), the whole thing(s), nothing but the thing(s), and that the measurements are not impacting, well, the measurements (see “Heisenberg”).
- Know exactly what you are comparing against.
- Carefully document a consistent baseline.
- Use “finer grained” approaches for best results, for example:
- It could be that putting some cores in SMT=1 mode (for single-threaded performance) and leaving some in SMT=8 mode (for high bandwidth) is better than SMT=1 uniformly
- Isolate paths, if possible. (i.e. assuming such-and-such processing is infinitely fast, where is the bottleneck?)
Start with fast
Use the fastest frequency processor available.
Spread the load
Exploit the multi-threading capabilities. Power® processors and systems are designed for bandwidth and throughput, more so than single-threaded performance. Workloads which can (or can be made to) exploit the multi-threading capabilities of the Power platform will see the best performance.
Use sufficient resources
- Use enough memory to avoid swapping. If swapping is unavoidable, consider using memory compression techniques like zswap or frontswap, which can help considerably at the cost of some available memory.
- Use enough processors, hardware threads to avoid contention.
- Use enough network bandwidth to avoid bottlenecks.
Optimize memory configuration
Refer to the “Redpaper” for the specific system. These can be found at the IBM Redbooks. These documents should have a “memory bandwidth” section describing the expected peak memory bandwidth based on memory configuration. Some systems have peak bandwidth when all DIMM slots are populated, some may trade higher memory capacity for some bandwidth when fully populated.
For the less risk-averse, overclocking may be an option. The IBM Power processors have built-in operating margins which are large enough for both IBM and its customers to have very high confidence that a system will run for a long, long time without failure, years or decades. It is possible to squeeze these margins significantly (if you are willing to accept the associated risk, which may include abandoning the warranty). Power Systems actually have tremendous flexibility in increasing frequencies for different portions of the chips and system for significant performance advantage (at increased risk of failure due to chip timings, environmental conditions (heat) or power draw). Contact your IBM representative for more information.
Update to the latest firmware
It is generally recommended that you update to the latest firmware available for the specific system. Be advised to read the release notes for any caveats about applying the firmware (including required HMC firmware levels, whether concurrent updates are possible, and so on). Latest firmware can be found at IBM Support: Fix Central.
Disable Idle Power Saver
Availability: IBM PowerVM®
- Log into ASM
- Expand System Configuration
- From Power Management, select Idle Power Saver
- Toggle Idle Power Saver Enable from New value to Disabled
- Select Save settings
EnergyScale: Favor Performance
- Log into ASM
- Expand System Configuration
- Expand Power Management
- Select Power Mode Setup
- Select Enable Dynamic Power Saver (favor performance) mode
- Select Continue
- Select Save settings
Dynamic Platform Optimization (DPO)
Availability: PowerVM partitioned systems
Question: What OS levels are DPO aware?
Answer: Red Hat Enterprise Linux (RHEL) 7 and SUSE Linux Enterprise Server (SLES) 12.
- For IBM Power Systems logical partitions (LPARs), the hypervisor (PowerVM) attempts to optimize partition placement to mitigate non-uniform memory access (NUMA) effects, such as when the LPAR is running on a certain chip, but the memory for the LPAR is more closely associated with different chip, and the time to access that memory is much higher than it would be if the memory were “local” to the running chip.
- The purpose of Dynamic Platform Optimization (DPO) is to place “co-processing” elements in close proximity. Avoiding NUMA affects by carefully placing threads, processes, and virtual servers on a single chip, DCM, book, or NUMA node.
- It can also be helpful to carefully size (processors and memory) the partitions as an oversized partition can negatively impact the placement of other partitions as well as have negative impacts on the performance of the partition itself by having more NUMA nodes on which both the hypervisor and operating system scheduler must decide placement.
- Verify NUMA characteristics of the partition. Use “numactl –hardware” to view processors, memory, and their layout in NUMA nodes.
Note that IVM does not support DPO, and placement must be done with more advanced techniques.
Disable processor frequency scaling
Availability: IBM PowerKVM®, non-virtualized
Set the maximum frequency available for scaling to the maximum possible frequency:
# frequencies=$(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) # max_freq=$(for i in $frequencies; do echo $i; done | sort -nr | head -1) # for cpu_max_freq in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq; do echo $max_freq > $cpu_max_freq done
Set the frequency governor to “performance”:
# for governor in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $governor done
Also, on PowerKVM host:
# cpupower frequency-info # cpupower frequency-set --governor performance
On Ubuntu, non-virtualized:
# cpufreq-info # cpufreq-set --governor performance
Consider dedicated resources
- Dedicated processors
- Dedicated memory
- Dedicated network
- Investigate single root input/output virtualization (SRIOV)
- With KVM, use PCI passthrough
Refer to the following wiki articles for details:
Use optimal SMT mode
There is a simple command to change the SMT mode for all cores at once.
# ppc64_cpu --smt=<n>
Where n is the number of threads per core (for POWER7: 1, 2, 4; for POWER8: 1, 2, 4, 8)
- SMT=1: 1 hardware thread per core; provides the most processor resources per hardware thread; thus, best performance per thread at the cost of fewer threads
- SMT=2: 2 hardware threads per core; better throughput than SMT=1 at the cost of some individual thread performance
- SMT=4: 4 hardware threads per core; better throughput than SMT=2 at the cost of some individual thread performance
- SMT=8: 8 hardware threads per core; better throughput than SMT=4 at the cost of some individual thread performance (available on POWER8 only)
Note that the higher throughput with more threads per core is mitigated by higher contention for shared processor resources, and is highly dependent on workload characteristics. (It may be, for example, that SMT8 does not perform significantly better than SMT4.)
There is also a way to change the SMT mode for each individual core. Note that all hardware threads always “exist”, but are either on-line or off-line, depending on the SMT mode. The “primary” hardware threads are numbered “n” where “n” is a multiple of the maximum number of hardware threads per core (0, 4, 8, … on POWER7; 0, 8, 16, … on POWER8). SMT=2 will enable another thread on each core. SMT=4 will enable 4 threads on each core, and SMT=8 will enable 8 threads on each core (available with POWER8 only).
For example, on a POWER7 system, to put all cores in SMT=4 mode, and then put the first core in SMT=1 mode:
# ppc64_cpu --smt=4 # echo 0 > /sys/devices/system/cpu/cpu3/online # echo 0 > /sys/devices/system/cpu/cpu2/online # echo 0 > /sys/devices/system/cpu/cpu1/online
Disable unnecessary services
The set of running services tends to change over time and between distributions.
- Carefully analyze the set of running services and evaluate which are required.
- Shut down unneeded services, either temporarily (a reboot will restart the service) or permanently (the service will not automatically start on reboot).
To show all currently running services:
# systemctl status --type=service --all
To show which services are started at boot:
# systemctl status --type=service
To stop a service now (only, but will restart at boot):
# systemctl stop
To stop a service permanently (will not restart at boot):
# systemctl disable
If not in use, IPv6 can be disabled to eliminate some processing in the network stack.
To disable it immediately dynamically:
# sysctl -w net.ipv6.conf.all.disable_ipv6=1
To disable at boot time, add the following to the Linux kernel command line either manually at boot time (which is not persistent), or in the bootloader configuration file (yaboot uses /etc/yaboot.conf; GRUB uses /boot/grub2/grub/cfg):
Disable “access time” accounting
By default, filesystem drivers will update a field’s “last accessed time” every time that file is accessed. This is generally used during archiving to determine which files were not recently accessed, and thus are candidates for archival. There is a cost associated with maintaining this information, with a write operation (to file cache, at least) for every read operation.
In most cases, it is safe to disable this feature. Add “noatime” to the mount options for the filesystem(s), as in this example from /etc/fstab:
UUID=[...uuid...] / ext4 defaults,noatime 1 1
Linux distribution kernels are necessarily built to run on the entire range of systems supported by the release. The compiler options are thus restricted to the instruction set (and possibly the scheduling semantics) of the oldest supported systems. It may be advantageous to rebuild the kernel with optimal compiler flags (-mcpu=native -O3), and possibly a later version of the compiler (Advance Toolchain). Note that running a custom-built kernel usually is not a configuration that conforms to the requirements of a support contract.