Overview

An IBM® POWER9™ processor has several features when compared to the previous generations of IBM POWER® processors. One such feature is the nest performance monitor unit (PMU) that enables precise measurement of socket-level resource utilization. This tutorial introduces nest PMUs that are available in IBM POWER9 processors and the Linux perf integration. The tutorial also describes different hardware and software components of nest PMU and explains a sample nest unit of IBM POWER9 processor with events and useful metrics.

IBM nest PMU

IBM POWER9 processor-based systems have many units called nest units, which are not in the core but are on the chip (un-core). These units are closely connected to the core to achieve higher performance. Nest units include symmetric multiprocessing (SMP) interconnect, memory controller synchronous (MCS), coherently attached processor proxy (CAPP), NVLink unit , and on-chip accelerator.

The nest units play an important role and work together to improve the overall socket performance. But at times, they could become a bottleneck.

For example, scheduling memory-intense applications such as in-memory database in sockets that already have high memory bandwidth usage could decrease the application throughput and increase the processing time for each of the transaction requests.

Why nest PMU

Memory bandwidth usage can be derived from per-cpu PMU events. However, there is a cost that is associated in programming all the processors with bandwidth events and post processing the counter data to determine the same. In addition to this, it may not provide details on all possible memory transactions.

Fig 1 : IBM POWER9 processor
IBM POWER9 processor

In a high-performance computing (HPC) application, it is important to understand the bandwidth usage of the SMP interconnect. Using the per-cpu PMU events for these measurements could increase OS jitters due to frequent event counter collection action.

IBM POWER9 processors implement nest PMUs, which enable measurement of socket-level resource utilization. Each nest PMU has dedicated performance monitoring counters and hardware events. Unlike traditional processor PMU events, nest PMU events focus on data that go off-core. IBM POWER9 processors also implement an accumulation logic in hardware, which is used to update event counter data from nest units to memory periodically.

In-memory collection

An end-to-end Linux software stack for the nest instrumentation is called in-memory collection counter. IBM POWER9 nest PMUs and events are exposed through Linux perf APIs. Linux kernel 4.14 and later versions support nest PMUs.

Fig 2 : Linux interface for in-memory collection Linux interface for in-memory collection

Because nest units are socket-level resources, a higher-level privilege (such as root user) is required to monitor them. Also, nest PMU do not profile individual programs and they cannot indicate what happens within a program.

Not all nest PMUs are enabled during system boot. System configuration plays a key role in enabling nest PMU Most of the basic nest PMUs such as powerbus, and MCS and inter-connect bus such as Xlinks are enabled. However, to enable some PMUs, such as NVLink, CAPP, and memory buffer chips, specific cards must be plugged into the system slots.

One of the key advantages of the nest PMU counters is that counter data is accumulated in the memory by the hardware accumulation logic. This reduces the latency in programming and in reading the counter data.

Nest memory controller synchronous (MCS) PMU

Memory controller synchronous (MCS), which is a chip-level unit, manages the flow of data going to and from the main memory. MCS in the POWER9 processor has support to attach industry standard dual inline memory modules (DIMMs) with or without a buffer chip. POWER9 nest MCS PMU is registered as nest_mcs*_imc. Based on the system configuration, the number of logical memory controllers will vary. A nest MCS PMU provides many interesting events such as socket-level read counts and write counts that are important to calculate total memory bandwidth.

For example, to get the list of supported nest PMU events of the MCS, run the following perf command:

# perf list nest_mcs
List of pre-defined events (to be used in -e):
nest_mcs01_imc/PM_MCS01_128B_RD_DISP_PORT01/ [Kernel PMU event]
nest_mcs01_imc/PM_MCS01_128B_RD_DISP_PORT23/ [Kernel PMU event]
nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT01/ [Kernel PMU event]
nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT23/ [Kernel PMU event]
nest_mcs01_imc/PM_MCS01_64B_RD_DISP_PORT01/ [Kernel PMU event]
nest_mcs01_imc/PM_MCS01_64B_RD_DISP_PORT23/ [Kernel PMU event]
nest_mcs01_imc/PM_MCS01_AMO_OP_DISP_PORT01/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_128B_RD_DISP_PORT01/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_128B_RD_DISP_PORT23/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_128B_WR_DISP_PORT01/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_128B_WR_DISP_PORT23/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_64B_RD_DISP_PORT01/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_64B_RD_DISP_PORT23/ [Kernel PMU event]
nest_mcs23_imc/PM_MCS23_AMO_OP_DISP_PORT01/ [Kernel PMU event]
#

Observe the above output of the perf list command for the nest_mcs_imc PMU.

Some of the nest PMU events also have a scale factor to be added. The scale value for each event is also exported in the /sysfs directory along with the event property.

For example, run the following commands to get the scale value for the nest PMU events of the MCS:

# pwd
/sys/bus/event_source/devices/nest_mcs01_imc/events
# ls
PM_MCS01_128B_RD_DISP_PORT01 PM_MCS01_128B_WR_DISP_PORT23.scale
PM_MCS01_128B_RD_DISP_PORT01.scale PM_MCS01_64B_RD_DISP_PORT01
PM_MCS01_128B_RD_DISP_PORT23 PM_MCS01_64B_RD_DISP_PORT01.scale
PM_MCS01_128B_RD_DISP_PORT23.scale PM_MCS01_64B_RD_DISP_PORT23
PM_MCS01_128B_WR_DISP_PORT01 PM_MCS01_64B_RD_DISP_PORT23.scale
PM_MCS01_128B_WR_DISP_PORT01.scale PM_MCS01_AMO_OP_DISP_PORT01
PM_MCS01_128B_WR_DISP_PORT23 PM_MCS01_AMO_OP_DISP_PORT01.scale
# cat PM_MCS01_128B_RD_DISP_PORT01.scale
256

By adding the 128-byte read and write event counter from both ports of each logical MCS, total memory bandwidth data in bytes can be obtained.

The following procedure describes how you can obtain the total memory bandwidth data:

  1. Start the following events by running the perf command in the shell prompt:

     `$` perf stat -e nest_mcs01_imc/PM_MCS01_128B_RD_DISP_PORT01/ -e nest_mcs01_imc/PM_MCS01_128B_RD_DISP_PORT23/ -e  nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT01/ -e nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT23/ -e nest_mcs23_imc/PM_MCS23_128B_RD_DISP_PORT01/ -e nest_mcs23_imc/PM_MCS23_128B_RD_DISP_PORT23/  -e nest_mcs23_imc/PM_MCS23_128B_WR_DISP_PORT01/ -e nest_mcs23_imc/PM_MCS23_128B_WR_DISP_PORT23/  -I 1000 –per-socket
    

    Notes:

    • The -e option in the perf stat command displays the event list.
    • The -I option provides the flexibilty to retrieve the counter data in milliseconds. In this case, the requested time interval is every second.
    • The --per-socket option aggregates the counter data socket-wise. Output similar to the following is displayed (complete output is not displayed here):

      [……]
      3.010493181 S0        1                768      nest_mcs01_imc/PM_MCS01_128B_RD_DISP_PORT01/
      3.010493181 S0        1                  0      nest_mcs01_imc/PM_MCS01_128B_RD_DISP_PORT23/
      3.010493181 S0        1                256      nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT01/
      3.010493181 S0        1                  0      nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT23/
      7.769641499 S8        1                  0      nest_mcs01_imc/PM_MCS01_128B_WR_DISP_PORT23/
      7.769641499 S8        1              5,376      nest_mcs23_imc/PM_MCS23_128B_RD_DISP_PORT01/
      7.769641499 S8        1                  0      nest_mcs23_imc/PM_MCS23_128B_RD_DISP_PORT23/
      7.769641499 S8        1                256      nest_mcs23_imc/PM_MCS23_128B_WR_DISP_PORT01/
      7.769641499 S8        1                  0      nest_mcs23_imc/PM_MCS23_128B_WR_DISP_PORT23/
      [……]
      
  2. Calculate the total read memory bandwidth per-socket and per-second (in GBps) using the optimized formula:

    ((Sum of all RD counts) * 64)/ (1024 * 1024 * 1024) + ((Sum of all WR counts) * 64)/ (1024 * 1024 * 1024)

The following sample graph illustrates the memory bandwidth metric data obtained from nest MCS counters and plotted for every second. Workload used here is a STREAM benchmark.

Fig 3 : Memory bandwidth metric data from nest MCS counters Memory bandwidth metric data from nest MCS counters

Conclusion

Nest counters are a new and welcome addition in IBM POWER9. You can easily obtain a number of interesting and important metrics from the system using the perf command. You can use the data to troubleshoot issues, improve performance, and improve efficiency of workloads that are running on an IBM Power Systems™ server.