Introduction

This document is intended to document the “state of the art” for low latency tuning, specifically targeted at Linux running on IBM Power Systems. This document is a collection of wisdom from experience, but is not meant to be exhaustive. It is a living document, with content added as new information is discovered. Not all sections are complete, nor even guaranteed to be accurate. Feedback on the Forums is welcome and encouraged.

The document is divided up into the following sections:

There are separate articles that cover the following tuning topics. You should review these articles first, as this article merely builds on top of a well-tuned system and application.

Hardware

Network: Ethernet

Several Solarflare Ethernet cards are supported on Power and designed for low latency. Use the OpenOnload™ drivers to effect interaction with the hardware directly from user-space.

Network: InfiniBand

InfiniBand is designed for low latency by interacting with the adapter from user-space. IBM supports several options. Cards are supported on Power and designed for low latency.  Use the OpenOnload™ drivers to effect interaction with the hardware directly from user-space.

System tuning for latency

First, tune for performance. See Linux on Power – System tuning.

Disable Floating Point Unit Computation Test

Availability: IBM PowerVM, all firmware levels

The “Floating Point Unit Computation Test” (a.k.a. CompuGuard) is a hardware test routine that will run periodically on the system if it is enabled. The purpose is to periodically test to ensure that the floating point operations on the processor are performing correctly. This service will periodically introduce latency on the running system, and is recommended that this be disabled for Real-Time or low latency environments. This can be disabled via the ASM menus on the system as described below:

  1. Log in to system’s ASM interface
  2. Expand System Configuration
  3. Select Floating Point Unit Computation Test
  4. Change Policy setting to Disabled
  5. Click Save settings

Enable Static Turbo mode

Availability: IBM PowerVM

Even in EnergyScale Dynamic Power Saver “Favor Performance” mode, the processors will still go to lower frequency (lower power) states when idle.  This transition can be disabled by setting the Dynamic Power Saver “Tuning Parameters” appropriately:

  1. Log into the system’s ASM interface
  2. Expand System Configuration
  3. Expand Power Management
  4. Select Tuning Parameters
  5. Leaving all other parameters at default values:
    1. For Algorithm Selector, set New value to 1
    2. For Step size for going up in frequency, set New value to 100
    3. For Utilization threshold for decreasing frequency, set New value to 0
    4. For Delta percentage for determining active cores, set New value to 0
    5. For Utilization threshold to determine active cores with slack, set New value to 0
  6. Click Save settings

Disable Memory Sleep States

Availability: IBM PowerVM AL770_063 (FW770.31, 2014-01-14); not currently necessary on POWER8.

Note: Memory Power Management functionality is not yet available on POWER8. Memory on POWER8 systems always runs in full-power mode.

Memory will enter a low-power sleep state if it has not been accessed for a certain period of time. The duration/enablement of this action is controllable by firmware.

  1. Log into system’s ASM interface
  2. Expand System Configuration
  3. Expand Power Management
  4. Select Memory Low Power State Control
  5. Set Requested mode to Disabled
  6. Click Save settings
  7. Expand Power/Restart Control
  8. Select Power On/Off System
  9. Click Save settings and power off
  10. Wait for the system to reach “power off” state.  The Real-time Progress Indicator can be of use here.
  11. Select Power On/Off System
  12. Click Save settings and power on

Run non-virtualized (bare-metal)

Availability: Power systems that support bare-metal or KVM

All IBM Power Systems have historically run with a hypervisor present, even if the only job of the hypervisor was to stay out of the way of a partition, which owned all of the system resources.  Recent systems have added support for IBM PowerKVM, which is essentially a Linux kernel providing the KVM host instance.  The firmware was enhanced to support a mode without the IBM PowerVM built-in hypervisor.  That firmware mode is called, OPAL. OPAL supports running a Linux instance natively, and has a sophisticated boot loader (“petitboot”), which enables easy installation.  It is therefore possible to install Linux natively on the most recent IBM Power Systems (those that support PowerKVM).

Operating system tuning for latency

Disable unnecessary services

The set of running services tend to change over time and between distributions. Carefully analyze the list of services enabled on the system and permanently disable all which are not required.

To get the list of running services:

# systemctl list-units --type=service

To stop a service immediately:

# systemctl stop <service>

To permanently disable a service (upon next boot):

# systemctl disable <service>

Disable snooze

Availability: IBM PowerVM, SLES11SP2, RHEL6.5, RHEL7 (kernel 3.7), powerpc-utils 1.2.20

“snooze_delay” is the (millisecond) delay before an idle processor yields itself back to the hypervisor.  A latency-sensitive virtual server may want to make this as high as possible, to avoid losing the processor any earlier than necessary (like the end of any timeslice for shared processors).  Other non-latency-sensitive virtual servers sharing the processors may want to set this value low or to zero, to yield the processor back as soon as possible.

# ppc64_cpu --snooze_delay=<n>

However, the strongest recommendation is to use dedicated processors.  This postpones giving control to the hypervisor.

Note: support for setting snooze_delay=-1 (infinite: do not snooze) went into kernel 3.7, and as of this writing (Jan 2013) is not in RHEL 6.3, whereas SLES11 SP2 does appear to have support. powerpc-utils 1.2.20 or above is required.

Disable CPU frequency scaling

Availability: x86, PowerKVM, and non-virtualized

Set the maximum frequency available for scaling to the maximum possible frequency:

# max_freq=$(for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies); do echo $i; done | sort -nr | head -1)
# for cpu_max_freq in /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq; do echo $max_freq > $cpu_max_freq; done

Set the frequency governor to performance:

# for governor in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do echo performance > $governor; done

Disable WAKEUP_PREEMPT

Availability: All platforms

Reference: SPECjEnterprise2010 – A performance case study

The Linux Completely Fair Scheduler (CFS) first appeared in the 2.6.23 release of the Linux kernel in October 2007. The algorithms used in the CFS provide efficient scheduling for a wide variety of systems and workloads. However, for some workloads there is one behavior of the CFS that can cost a few percent of CPU utilization.

In the CFS, a thread that submits I/O, blocks and then is notified of the I/O completion preempts the currently running thread and is run instead. This behavior is great for applications such as video streaming that need to have low latency for handling the I/O, but it can actually hurt performance in some cases. For example, when a thread submits I/O, such as sending a response out on the network, the I/O thread is in no hurry to handle the I/O completion. Upon I/O completion, the thread is simply finished with its work. Moreover, when an I/O completion thread preempts the current running thread, it prevents the current thread from making progress. And when it preempts the current thread it can ruin some of the cache warmth that the thread has created. Since there is no immediate need to handle the I/O completion, the current thread should be allowed to run. The I/O completion thread should be scheduled to run just like any other process.

The CFS has a list of scheduling features that can be enabled or disabled. The setting of these features is available through the debugfs file system. One of the features is WAKEUP_PREEMPT. It tells the scheduler that an I/O thread that was woken up should preempt the currently running thread, which is the default behavior as described above. To disable this feature, you set NO_WAKEUP_PREEMPT (not to be confused with NO_WAKEUP_PREEMPTION) in the scheduler’s features.

# mount -t debugfs debugfs /sys/kernel/debug
# echo NO_WAKEUP_PREEMPT > /sys/kernel/debug/sched_features
# umount /sys/kernel/debug

Mitigate scheduling preemption

Availability: All platforms

Reference: SPECjEnterprise2010 – A performance case study

sched_min_granularity_ns is the number of nanoseconds a process is guaranteed to run before it can be preempted. sched_latency_ns is the period over which CFS tries to fairly schedule all the tasks on the runqueue. All of the tasks on the runqueue are guaranteed to be scheduled once within this period. So, the greatest amount of time a task can be given to run is inversely correlated with the number of tasks; fewer tasks means they each get to run longer.

The parameter sched_wakeup_granularity_ns is similar to the sched_min_granularity_ns parameter. The documentation is a little fuzzy on how this parameter actually works. It controls the ability of tasks being woken to preempt the current task. The smaller the value, the easier it is for the task to force the preemption.

Disable multipath

Availability: All platforms

Multipath is a means of utilizing multiple paths to the same storage device as a single device with redundancy to facilitate higher bandwidth, higher availabilty or both. Two logical storage devices are bound together into a single virtual storage device.

multipathd is a service daemon that monitors the multipath devices for failure, and by default, it polls the devices every 5 seconds.

Ideally you should avoid this polling altogether to avoid the disruption. However, if multpathd must run, it is possible to increase the “polling_interval” in the /etc/multipath.conf file.

In some configurations, it is difficult or nearly impossible to install without multipath being enabled automatically by the installation process.  However, it is not impossible to remove multipath from an installed system:

  1. Remove the device from /etc/multipath.conf:
    • Remove the device from the blacklist_exceptions section if already there, or
    • Add the device to the blacklist section
  2. Disable multipathd at reboot:
    • On RHEL, # systemctl disable multipathd
  3. Adjust /etc/fstab, if necessary:
    • LVM will rescan at boot and needs no changes in /etc/fstab
    • For physical partitions, entries in /etc/fstab should be changed from paths like /dev/mapper/mpathx to /dev/sdy
  4. Disable multipath during boot:
    1. On RHEL, dracut --verbose --force --omit multipath
    2. Reboot

Mitigate pdflush

Availability: All platforms

pdflush wakes up periodically to flush dirty superblocks to storage.  According to, R.I.P. pdflush, ext4 filesystems (among others) do not even take advantage of the mechanism that pdflush uses, so on systems using ext4 exclusively, pdflush will always wake up and find nothing to do.

It is impossible to turn off pdflush, as it is an integral kernel thread, but you can tell it to wake up a lot less often, as in this example, every 360000 centiseconds, or one hour:

# echo 360000 > /proc/sys/vm/dirty_writeback_centisecs

Disable RTAS event scanning

Availability: Power, non-virtualized only

The Power Architecture Platform Reference (PAPR), a standard to which IBM Power Systems conform, dictates that the platform’s Run-Time Abstraction Services (RTAS) must be periodically scanned for new event reports.  In the Linux kernel, this is implemented as a daemon, of sorts (in reality, it’s a self-rescheduling workqueue item).  Unfortunately, it is also defined in PAPR to do subsequent scans from different cores.  As of this writing, the current implementation in the Linux kernel will schedule itself on the next online core, regardless of any other restrictions like cgroups or isolcpus settings, so all online cores will eventually be hit.

At present, there is no trivial method (other than changes to the Linux kernel source code) to disable this scan for PowerVM or PowerKVM guests.

You could disable it by recompiling the kernel and disabling the appropriate code in rtasd.c.  At the time of this writing, it is not clear if there are negative side-effects to disabling the RTAS event scans, so this is not recommended.

Note that when running non-virtualized (the operating system is running directly on OPAL firmware), RTAS event scanning is not performed.

Mitigate decrementer overflow interrupt

Availability: Power, but still under development

The decrementer register on IBM Power Systems servers is a 32-bit quantity, and is decremented at the frequency of the time-base, which on IBM Power Systems servers is 512 MHz.  The maximum value of the register is 231-1, or 2147483647.  At 512 MHz, this would decrement to zero in about 4.2 seconds.  So, a completely idle processor will still see a decrementer interrupt every 4.2 seconds.  There is currently no way to eliminate this interrupt.

The following patch will mitigate the interrupt by avoiding some optional housekeeping done when the interrupt occurs when the only action necessary is to reset it: https://lists.ozlabs.org/pipermail/linuxppc-dev/2014-October/121795.html

On a POWER8 system, the interrupt duration was reduced from about 30 microseconds to about 200 nanoseconds.

Remove processors from scheduling eligibility permanently

Availability: All platforms

Keep the Linux kernel scheduler from even considering a set of CPUs for tasks.

Use the kernel command line parameter isolcpus=<cpu list> to isolate a set of CPUs from consideration for scheduling tasks. For example, add isolcpus=1-63 to reserve all but CPU 0 (on a system with 64 CPUs) for specific application use. Use cgroups/cset, taskset, or numactl commands to force tasks explicitly on those CPUs.

Remove processors from scheduling eligibility dynamically

Availability: All platforms

Also known as CPU shielding. See this page for additional information: cpuset

# cset shield --cpu=4-63 --kthread=on

All existing tasks, including kernel threads, which can be migrated from the shielded CPUs (4-63 in the example above) will be migrated to the unshielded CPUs (everything else).

cset is a front-end to the cgroup infrastucture.  It is possible to manipulate cgroups manually.

To set up a cgroup, “mycgroup”, with a subset of the available CPUs and all memory:

# mount -t tmpfs tmpfs /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpuset
# mount -t cgroup -ocpuset cpuset /sys/fs/cgroup/cpuset
# cd /sys/fs/cgroup/cpuset
# mkdir mycgroup
# cd mycgroup
# echo 8-95 > cpuset.cpus
# cat /sys/fs/cgroup/cpuset/cpuset.mems > cpuset.mems

Disable scheduler domain rebalancing

Availability: All platforms

Set SD_LOAD_BALANCE bit to zero for all critical cpu’s domains, for example:

for cpu in $(seq 1 63); do
    for domain in /proc/sys/kernel/sched_domain/cpu$cpu/domain*/flags; do
        flags=$(cat $domain)
        # ensure low order bit of flags (SD_LOAD_BALANCE) is zero
        echo "$flags - $((flags + 1)) / 2 + $flags / 2" | bc > $domain
    done
done

Disable watchdogs

Availability: All platforms

Watchdog kernel threads wake up regularly.To disable the watchdog kernel threads:

# sysctl kernel.nmi_watchdog=0

Kernel parameters:

nmi_watchdog=0 nowatchdog nosoftlockup

Use static network configuration

The DHCP client must periodically renew the lease for it’s IP address, at an interval specified by the DHCP server.  It is preferable to avoid this by using static network configuration.

Disable inactive network interfaces

It was observed that physically disconnected interfaces which were nevertheless configured as “UP”, were generating system interrupts. Make sure that no unnecessary network interfaces are configured, even if physically disconnected.

Use static and/or one-shot IRQ balancing

To stop immediately (will restart on reboot):

# systemctl stop irqbalance

(or irqbalance or irqbalancer, or…)

To stop upon reboot (will not stop immediately):

# systemctl disable irqbalance

To balance IRQs once only and exit:

# IRQBALANCE_ONESHOT=1 IRQBALANCE_BANNED_CPUS=ffffffff,fffffff0 irqbalance

Fragment high-cost TCE-related hypervisor calls

Availability: PowerVM

Add kernel command line parameters:

  • bulk_remove=off
  • multitce=off

Go tickless (or not)

Eliminating scheduler clock interrupts would be ideal, but is currently difficult in practice.  Work associated with the tick can “back up”, making the less-frequent interrupts take longer.  If unable to eliminate ticks altogether, it may be better to keep them (nohz=off) and/or increase their frequency (recompile kernel with CONFIG_HZ_250 or 300 or 1000).

Availability: RHEL7

To eliminate most ticks, get a kernel with CONFIG_NO_HZ_FULL.  This enables the kernel to significantly reduce timer interupts on CPUs where there is a single runnable task and thus no need for scheduling.  CPUs thus enabled are called “adaptive ticks” CPUs.  The capability is enabled in RHEL7 kernels.  However, by default, no CPUs are defined as adaptive ticks CPUs.  To enable a set of CPUs to be adaptive ticks CPUs, add nohz_full=<cpulist> to the kernel command line.

To elminate more ticks, recompile kernel with CONFIG_NO_HZ_FULL_ALL

Reduce scheduling migrations

Availability: All platforms

Disable SD_WAKE_AFFINE, SD_WAKE_BALANCE or both.

vmstat offloading

Availability: Submitted for upstream

Utilization statistics, commonly displayed with the vmstat command, are gathered at the beginning and end of interrupt processing, lengthening the interrupt processing time.  Some of this processing can be offloaded to a sacrificial thread, reducing interrupt latencies.  This is currently being added to upstream kernels, and not in any enterprise distributions.  (a.k.a. “Lameter patches”)

RCU offloading

Availability: Targeted upstream

RCU performs some housekeeping during interrupt processing.  There are some recent patches being pushed into upstream kernels to move this processing to a sacrificial core, thus reducing interrupt latencies accordingly.

Kernel parameter: rcu_nocbs=<cpu list>

Or, kernel config CONFIG_RCU_NOCB_CPU_ALL=y

Application tuning (black box)

Real-time priorities

The Linux kernel has two-tiers of task priorities: real-time and other. The vast majority of tasks, both user and kernel, are other.  Critically important tasks are given real-time priorities. The lowest real-time priority is above the highest other priority, thus all tasks with real-time priorities will take precedence in scheduling over tasks with other priorities.  It is even possible to assign real-time priorities to user tasks that are higher than important kernel tasks with real-time priorities, although this should be done with extreme caution.

Use the chrt command to launch a process with real-time priorities:

# chrt -f 50 <command>

Note that you must have the ability to do so, either by being root, or by having privileges defined in /etc/security/limits.conf.  To allow the @realtime group to use real-time priorities:

@realtime hard rtprio 90

If a soft limit is set for the maximum real-time priority that is less than the hard limit and needs to be raised, use the ulimit -r command to do so:

$ ulimit -r 90

Make sure to set the real-time bandwidth reservation to zero, or even real-time tasks will be asked to step aside for a bit:

# echo 0 > /proc/sys/kernel/sched_rt_runtime_us

Application tuning (white box: source code changes)

Avoid page faults

Permit user(s) and group(s) to lock memory by modifying the /etc/security/limits.conf file. The following example uses the @realtime group:

@realtime    -    memlock     $MEMLOCK

Add system call to pin current and future memory for the process/thread:

mlockall(MCL_CURRENT|MCL_FUTURE)

Somewhat related: see the huge pages section in the Linux on Power application tuning document.

Coming soon: mpin() Refer to: Locking and pinning for more details.

One of the problems with memory locking is that it doesn’t quite meet the needs of all users. A page that has been locked into memory with a call like mlock() is required to always be physically present in the system’s RAM. At a superficial level, locked pages should thus never cause a page fault when accessed by an application. But there is nothing that requires a locked page to always be present in the same place; the kernel is free to move a locked page if the need arises. Migrating a page will cause a soft page fault (one that is resolved without any I/O) the next time an application tries to access that page. Most of the time, that is not a problem, but developers of hard real-time applications go far out of their way to avoid even the smallest amount of latency caused by a soft fault. These developers would like a firmer form of locking that is guaranteed to never cause page faults. The kernel does not currently provide that level of memory locking.

A new mpin() system call would function like mlock(), but with the additional guarantee that the page would never be moved and, thus, would never generate page faults on access.

Real-time priorities

You can programmatically change to real-time priorities using the sched_setscheduler API call.

Busy wait

To avoid the latency inherent in “waking up”, use busy waiting, but avoid having all waiting threads thrashing on a single memory location, or “false sharing” on a small set of cache lines.

Use epoll

epoll() is preferred over select(), because select() must look at the entire set of file descriptor bitmasks, and epoll() has a simple list of file descriptors.

Pre-allocate

It may go without saying, but avoid long path API calls in the critical section.  One set of large path API calls is for memory management:  malloc(), free(), and so on. Allocate all required memory as early as reasonable and possible.  If a size is unknown, allocate a worst-case size, trading memory inefficiency for critical-section path-length.  If worst-case is unbounded, pick a reasonable bound (“fast path”) and take a path-length hit only for exceptionally large needs (“slow path”).

Pre-compute

Avoid unnecessary computation in the critical path.  This may seem obvious, but can be as low a granularity as pointer dereferences.  Dereference pointers outside of the critical section if possible, and keep the results in registers.

Pre-fetch

Many modern processors already attempt to predict memory access paterns, and will “pre-fetch” portions of memory which seem likely to be accessed in the future.  Random or isolated access patterns cannot be predicted.  It may be advantageous to explicitly pre-fetch memory locations that will be needed in the near future.  Bring as much data to be utilized within the critical section into the local caches as possible before entering the critical section.

  • Use: __builtin_prefetch
  • Assembly instruction: dcbt (Data Cache Block Touch)
  • Assembly instruction: dcbtst (Data Cache Block Touch for Store)
  • Procrastinate

    The converse of doing things before the critical path is to postpone doing things until after the critical path.  This might include heavyweight things like logging, medium things like formatting time-stamps, or simple things like storing to memory.

    • epoll
    • Low latency messaging APIs
    • Real-time priorities SCHED_FIFO/SCHED_RR, priority
    • Task binding (sched_setaffinity)

Analysis

ftrace

ftrace is enabled in RHEL7, but kernel function tracing is not.  A kernel can be built with CONFIG_FUNCTION_TRACER=y to enable kernel function tracing.

Event tracing

Use the perf command.

/sys/kernel/debug/tracing/trace_marker (possible user-mode trace event capability)

Profiling

perf and the limitations of sampling for low latency analysis.

PAPI

http://icl.cs.utk.edu/papi/overview/index.html

Old-fashioned instrumentation

Use sphgettimer() to get time-stamps efficiently.

Join The Discussion

Your email address will not be published. Required fields are marked *