Introducing IBM Power10 Functional Simulator

One of the hot topics in the 2020 Hot Chips conference was about IBM Power10 processors. The Power10 processor-based servers are now available, but did you know that you can download the simulator from the IBM website to try out the new instruction sets added by the Power10 processor before you buy?

In this article, Masanori Mitsugi of IBM Systems Lab Services and Yoshikazu Kugii of IBM Technical Sales introduce Power10 Functional Simulator, which is a simulator environment for Power10 processor.

What is Power10 Functional Simulator?

In the development of processors and their systems, it is a common practice to proceed with the development of compilers and operating systems using a simulator environment in parallel with the development of processor chips. This is done internally by IBM also, where different types of simulator environments are used for development depending on the simulation level. Many of the recent IBM Power processors and its system development tools are open to the public to promote the ecosystem.

The Functional Simulator is a tool that has been used by multiple IBM internal teams (compiler development team, firmware, Linux OS development team, database development team, performance team, research team, chip verification team, and so on) for more than 20 years. The Power10 Functional Simulator is a Power10 version of the Functional Simulator, a full instruction set simulator that can simulate all instruction sets on Power10 processors. As software, it is possible to run Power10 firmware, bare metal software, Linux and applications on Linux.

Note that Power10 Functional Simulator serves as a full instruction set simulator for Power10 processor. It may not model all aspects of IBM Power10 hardware / system (other than Power10 processor, such as network or storage I/O subsystem), and thus may not exactly reflect the behavior of the Power10 hardware.

The Power10 Functional Simulator includes the following features:

  • Power10 hardware reference model
  • Full instruction set simulator for Power Instruction Set Architecture (ISA) as implemented in Power10
  • Models of complex symmetric multiprocessing (SMP) effects
  • Architectural modeled areas:
    • Functional behavior of all units: Load/Store, fixed-point unit (FXU), floating-point unit (FPU), Decimal Floating Point (DFP), Vector Multimedia Extension (VMX), Vector Scalar Extension (VSX), and so on
    • Exceptions and interrupt handling
    • Address translation, both para virtualized hardware page table (HPT) and two-level radix tree
    • Memory and basic translation cache modeling: Segment lookaside buffer (SLB), translation lookaside buffer (TLB), effective to real address translation (ERAT)
    • Instruction prefix support
    • VSX Matrix-Multiply Assist (MMA) instructions for AI
    • Reduced-precision instructions to accelerate matrix multiplication
    • Copy-paste facility
    • New alternate interrupt location (AIL) and hypervisor alternate interrupt location (HAIL) programmability feature for Linux/hybrid cloud
  • Linux and hypervisor development and debug platform
  • TCL command-line interface provides:
    • Custom user initialization scripts
    • Processor state control for debug: Step, run, cycle run-to, stop, and so on
    • Register and memory read/write interaction

If required, download Power10 Functional Simulator.

How Power10 Functional Simulator works?

The Power10 Functional Simulator works with the stack as shown in Figure 1.

Note: Only x86_64 simulator is publicly available, ppc64le simulator is for internal use.

Figure 1. Power10 Functional Simulator stack

Figure 1

To set up the Power10 Functional Simulator, install rpm or deb package after downloading on Linux, and then install various images (firmware, Linux kernel, and Linux disk) by following the procedure in the user guide and the README file (/opt/ibm/systemsim-p10/examples/linux/README). After installing the images, you can start the simulator in the X Window System environment. (X Window System is required because xterm and tcl interfaces are used in the simulator). A helper application called callthru is available in the simulator to exchange files between the host and the simulator.

Figure 2 shows a screen capture of the simulator execution windows. The front window on the left is the console of the simulator, and the back window on the right is the command window of the simulator.

Figure 2. Power10 Functional Simulator execution screen

Figure 2

Power10 new features to try with Functional Simulator

Because the Power10 Functional Simulator is a simulator environment for Power10 processors, it is possible to check new Power10 instructions. Power10 support for the LLVM or GCC compiler and the Linux kernel has already been published and is supported by relatively newer versions of the compiler and kernel. The Linux image of Functional Simulator also includes a kernel and compiler for Power10. A compiler called Advanced Toolchain is available, allowing you to try out new features for Power processors with better tuning than GCC’s mainstream compiler. This section explains some new instructions that you can try using these compilers.

Prefix new instruction format

Power10 supports a new instruction format with prefix. The instruction set architecture in Power is the reduced instruction set computer (RISC) architecture, which is based on a 32-bit fixed instruction length. But the new instruction format is a 32-bit prefix added before this 32-bit instruction. This new instruction makes it possible to extend the instruction. This extension is a revolutionary and historic change for the Power Architecture, which has used a fixed instruction length for more than 30 years.

As an easy-to-understand example, a 64-bit instruction length makes it possible to handle larger immediate values, which means it is possible to handle large values with one instruction. In addition, as the opcode and register number fields are also expandable, such as the PERMUTE instruction in vector operations, more complicated processing can be described with one instruction.

Let’s check how it works on the Power10 Functional Simulator with an example code (GCC test code).

When you compile this code on the Power10 Functional Simulator, you can see the code compiled for Power10 is one instruction while the code compiled for Power9 is two instructions as follows:

Code compiled on Power10 Functional Simulator
root@ubuntu1804mambo:~# cat paddi.c
unsigned long add (unsigned long a) { return a + 0x12345U; }
root@ubuntu1804mambo:~# /opt/at14.0/bin/gcc -O2 -mcpu=power10 -c paddi.c
root@ubuntu1804mambo:~# /opt/at14.0/bin/objdump -S paddi.o
paddi.o:     file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <add>:
   0:   01 00 00 06     paddi   r3,r3,74565
   4:   45 23 63 38 
   8:   20 00 80 4e     blr
        ...
root@ubuntu1804mambo:~# /opt/at14.0/bin/gcc -O2 -mcpu=power9 -c paddi.c
root@ubuntu1804mambo:~# /opt/at14.0/bin/objdump -S paddi.o
paddi.o:     file format elf64-powerpcle
Disassembly of section .text:
0000000000000000 <add>:
   0:   01 00 63 3c     addis   r3,r3,1
   4:   45 23 63 38     addi    r3,r3,9029
   8:   20 00 80 4e     blr
        ...

In this way, Power10 makes it possible to write programs with fewer instructions. Furthermore, in addition to this prefix instruction, Power10 has an enhancement called instruction fusion that allows two instructions to be executed together as one micro operation in the decoder, and therefore, it is possible to operate at high speed even for programs other than those compiled for Power10.

Matrix-Multiply Assist (MMA) instruction

Power10 provides matrix operation instructions called Matrix-Multiply Assist (MMA) instructions, in addition to the VMX/VSX instructions, which are vector operation instructions that can be used on Power8 and Power9 processors.

Four dedicated hardware units are installed for each CPU core on Power10, which enables high-speed processing of matrix outer product operation A ← {±}A {±} XYT. In addition to double precision and single precision as shown in the following figure, Bfloat half precision and Int4 operation are also supported. These make it possible to run AI inference workloads more efficiently and quickly without an extension card such as a GPU.

Figure 3. Supported precision and peak [FL] OPS performance

Figure 3

For full utilization of four MMA dedicated units for each core, the read/write bandwidth of memory and L3/L2/L1 cache has also been redesigned in Power10, and each has double the bandwidth compared to Power9.

Figure 4. Read/write performance of MMA unit and memory/L3/L2/L1 cache

Figure 4

This MMA instructions can be fully simulated with Power10 Functional Simulator, and so it is possible to test them with the assembler or the built-in instructions (intrinsics) of the compiler. The MMA instructions implementation has high affinity with the conventional VMX/VSX instructions and has a mechanism that issues instructions using the same registers as the VMX/VSX instructions.

As an example, a single precision 4×4 matrix product described by the built-in instructions of the compiler will be as follows.

Example of a single precision 4×4 matrix product
typedef vector unsigned char    vec_t;
typedef __vector_quad   acc_t;

void sgemm_kernel_4x4(float* a, float* b, float* c, int K, int lda, int ldb, int ldc)
{
    int i;
    vec_t vec_A, vec_B, vec_C[4];
    acc_t acc_0;
    __builtin_mma_xxsetaccz(&acc_0);
    for (i = 0; i < K; i++) {
        vec_A = *((vec_t *)(a + (i * lda)));
        vec_B = *((vec_t *)(b + (i * ldb)));
        __builtin_mma_xvf32gerpp(&acc_0, vec_A, vec_B);
    }
    __builtin_mma_disassemble_acc(vec_C, &acc_0);
    *((vec_t *)(c)) = vec_C[0];
    *((vec_t *)(c + ldc)) = vec_C[1];
    *((vec_t *)(c + (2 * ldc))) = vec_C[2];
    *((vec_t *)(c + (3 * ldc))) = vec_C[3];
}

With regard to creating and executing programs using MMA instructions, the IBM Redbooks publication, Matrix-Multiply Assist Best Practices Guide, has been released. Refer to it for more details of various codes, and note that all the code can be run on the simulator.

You can try the MMA instructions as already described, but in case of actual use for AI applications, MMA can be easily leveraged by using several optimized libraries. Like the Linux kernel and compiler, Power10 MMA support for several optimized libraries have already been implemented and released.

In addition to optimized libraries, deep learning frameworks such as TensorFlow and PyTorch can also be accelerated by MMA instructions by using Eigen for TensorFlow and OpenBLAS for PyTorch internally. MMA optimization for other runtimes such as Open Neural Network Exchange (ONNX) is also underway.

Try new Power10 instructions quickly!

In this article, we have briefly introduced the overview, usage, and examples of Power10 Functional Simulator, which allows you to quickly try out Power10 instructions. Now that the Power10 server is released, we plan to publish detailed information such as performance information as needed, so look forward to it!

Finally, the creation of this article was supported by many IBM members: Vincent Lim, the lead of the Power10 Functional Simulator development team, Kazunori Ogata and Kazuaki Ishizaki of IBM Tokyo Research Laboratory, Hiroyuki Tanaka and Tamara K Deedrick of IBM Lab Service. And we would like to thank all the members of the Functional Simulator team who are involved in the development.