Learning objectives

In this article, you can learn how profile guided optimization works, and how it can be used with GCC to acclerate any application. You can also learn how to apply profile-guided optimization (PGO) to Facebook Zstandard compression library (zstd).

Profile-guided optimization

Compilers do performance optimizations by using known heuristics, making most probable guesses about code execution. For example, a compiler may decide the branch to take based on where is it located in a loop, and it may choose to inline a function based on its size. To extract good performance out of a program, it is important for the programmer to have intrinsic knowledge of the compiler and system design so that the programmer can provide hints or annotations to the compiler. Unfortunately, this is not always possible. Another performace optimization technique that offers a better advantage is profile-guided optimization (PGO).

PGO is a method used by compilers to produce optimal code by using application runtime data. Because this data comes directly from the application, the compiler can make much more accurate guesses.

How does it work, phases?

1. Instrumentation

The first phase is the instrumented compilation. Instrumented compilation typically produces an executable program that contains probes in each of the basic blocks of the program. Each probe counts the number of times a basic block runs. If the block is a branch, the probe records the direction taken by that branch.

gcc -fprofile-genrate=<profile_dir> progname.c

2. Profiled execution

The second phase is the profiled execution. When run, the instrumented program generates a data file that contains the execution counts for the specific run of the program.

3. Optimization

Information from the profiled execution of the program is fed back to the compiler. This data is used to make a better estimate of the program’s control flow. The compiler uses this information to produce an executable file, relying on this data rather than the static heuristics.

gcc -fprofile-use=<profile_dir> progname.c

A key factor in deciding whether you want to use PGO lies in knowing sections of your code that are the most heavily used. If the data set provided to your program is very consistent and it elicits a similar behavior on every execution, then PGO can probably help to optimize your program execution. However, different data sets can elicit different algorithms to be called. This can cause the behavior of your program to vary from one execution to the next.

What optimizations are performed?

A compiler performs various optimizations based on a profile. Some of the most common and effective techniques are explained in this section.

Function inlining

Function inlining is a technique where we inline high frequency functions into one of the calling functions to reduce function call overheads. Normally, a compiler makes these decisions based on the code block size etc. Inlining decisions are most effective when the compiler has a vague idea of the call pattern. Let us look at an example.

The following diagram has parts of a program’s flow graph. Circles represent functions and the edges represent frequency of the call. When a compiler has no heuristic data, it may or may not choose inlining at all. The story is different when we have runtime heuristics available.

Figure 1. A typical program flow graph where edges represent frequency of calls
profile guided optimization

In Figure 1, it is not difficult to observe that function D get called several times. Clearly we will make the most by inlining D into C and E into D. It is an easy decision when we have function call heuristics available.

Block ordering

A code block is a sequence of consecutive instructions where execution begins at the top and ends at the bottom. A typical program consists of code blocks and branches. Although the easiest approach to code generation is to organize code blocks in the same sequence as specified by the programmer, it is not the most performance effective approach. Let us look at an example to see why.

Consider the following code.

       if (condition = TRUE)
       } else {



Without any knowledge about the condition, the compiler generates the following blocks of code.

Figure 2. Default layout

A compiler’s aim is to achieve call locality or to perform minimum amount of memory operations. Functions which are tightly bound should be collocated to minimize instruction cache fetches. In the given example, if the profile suggests that the condition is FALSE, most of the time, it would make sense to keep func_c closer to func_a. Similarly func_d will be placed close to func_b.

Figure 3. Optimized layout

Register allocation

Although not important for reduced instruction set computing (RISC) architectures than their complex instruction set computing (CISC) counterparts, optimal use of available regsiters makes a lot of difference in application performance. With a sample heuristics available, the compiler can easily find the hot and very frequently used variables. This prevents unnesseary cache (perhaps memory) access, speeding up the application.

Typically, the register allocator uses a static heuristic, preferring to keep variables used in loops in registers, for example. This results in register allocation that is locally optimized within the loop, but it may not produce the best total performance. With profile information, this decision is made based on the actual usage.

There are other performance enhancement techniques that gain immensely with a profile. They include optimizing virtual calls, dead code seperation, and so on.

Using GCC PGO to accelerate zstd on AIX

The effect of PGO can be most easily observed in processor-intensive applications. We chose Facebook zstd compression/decompression library to demonstratre the benefits of PGO. This experiment requires an IBM® AIX® logical partition (LAPR) with GCC 7.0 or a later version installed.

Configuring IBM AIX and GCC

A GCC compiler suite can be installed on AIX from various sources, however it is easiest to do so using YUM. Refer to the YUM on AIX guide.

Obtaining sources

Clone the zstd library from GitHub or download a compressed version using the following command to get the latest source.

git clone https://github.com/facebook/zstd.git

Benchmarking using zstd library

The zstd library also has a benchmark program in its source which performs compression and decompression on memory files. Because we do not have to deal with the files present on disk, factors such as I/O devices do not affect our results. The source for this benchmark program is available in tests/fullbench.c. Following steps can be performed to obtain the benchmark.

  1. Edit Makefile in the tests directory as follows.

    zstdDIR               = ../lib
    PRGDIR                 = ../programs
    PYTHON              ?= python3
    TESTARTEFACT := versionsTest
    CC                         = gcc
    CPPFLAGS       += -I$(zstdDIR) -I$(zstdDIR)/common -I$(zstdDIR)/compress \
                               -I$(zstdDIR)/dictBuilder -I$(zstdDIR)/deprecated -I$(PRGDIR)
    ifeq ($(OS),Windows_NT)   # MinGW assumed
    CPPFLAGS       += -D__USE_MINGW_ANSI_STDIO   # compatibility with %zu formatting
    CFLAGS           ?= -O3
    CFLAGS           += -Wall -Wextra -Wcast-qual -Wcast-align -Wshadow                 \
                              -Wstrict-aliasing=1 -Wswitch-enum -Wdeclaration-after-statement \
                               -Wstrict-prototypes -Wundef -Wformat-security                   \
                              -Wvla -Wformat=2 -Winit-self -Wfloat-equal -Wwrite-strings      \
                              -Wredundant-decls -Wmissing-prototypes
    PGOFLAGS     +=
    FLAGS              = $(CPPFLAGS) $(CFLAGS) $(LDFLAGS)
  2. Run the gmake command to produce fullbench in tests directory.

    rm fullbench; gmake ; ./fullbench -b2
  3. Obtain initial results by running the following command.

    # ./fullbench -b2
    *** Zstandard speed analyzer 1.3.8 32-bits, by Yann Collet (Nov 26 2018) ***
     Sample 10MiB :
     2#decompress                   :  1215.8 MB/s  (10000000)
  4. Edit PGOFLAGS as follows:

    PGOFLAGS           =            -fprofile-generate=/tmp/pgo
  5. Run the following command to generate profile data in /tmp/pgo.

    rm fullbench; gmake ; ./fullbench -b2
    *** Zstandard speed analyzer 1.3.8 32-bits, by Yann Collet (Nov 26 2018) ***
    Sample 10MiB :
    2#decompress                   :   770.5 MB/s  (10000000)

    It can be noted that this run is considerably slower. This is because of the probes we placed using the fprofile-generate command. Let us look at the /tmp/pgo directory. It consists of *.gcda files. Each file here contains metadata about the hot code blocks obtained from our last test run.

    ls /tmp/pgo
    # ls /tmp/pgo
    benchfn.gcda                fse_compress.gcda           huf_compress.gcda           xxhash.gcda                    zstd_decompress.gcda        zstd_lazy.gcda
    datagen.gcda                fse_decompress.gcda         huf_decompress.gcda         zstd_common.gcda            zstd_decompress_block.gcda  zstd_ldm.gcda
    entropy_common.gcda         fullbench.gcda              pool.gcda                   zstd_compress.gcda          zstd_double_fast.gcda       zstd_opt.gcda
    error_private.gcda          hist.gcda                   util.gcda                   zstd_ddict.gcda             zstd_fast.gcda              zstdmt_compress.gcda
  6. Use the profile data from previous step and perform the following tasks.

    1. Edit PGOFLAGS as follows.

      PGOFLAGS = -fprofile-use=/tmp/pgo

    2. Run the following command to use profile data in /tmp/pgo.

      rm fullbench; gmake ; ./fullbench -b2
      *** Zstandard speed analyzer 1.3.8 32-bits, by Yann Collet (Nov 26 2018) ***
      Sample 10MiB :
      2#decompress                   :  1452.7 MB/s  (10000000)


It should be noted that profile-based optimization is suited to those applications where a sample run mimics the general behaviour of the program. Also, it must be ensured that the data set used to produce profile information must closely resemble the data set in actual use. If the application behaves differently for different data sets, it is possible to observe a regression in performance. However, for large applications that are run frequently, it is observed that PGOs can prove to be a significant source of performance.