So you have a C/C++ application running on Linux on x86 and you want to quickly port it to Linux on POWER? Read on.
What is “vectorized code”?
“Vectorized code”, in this context, is code that takes advantage of facilities in modern processors for processing multiple sources of data in a single instruction, also known as “Single Instruction, Multiple Data”, or SIMD. Special wide registers are provided that can be set with multiple data values, and a single instruction can manipulate all data items in a register simultaneously. This can provide significant performance advantages.
What are vector intrinsics?
Over the years, x86 processors have added more and more “vector” capabilities, layered one on the other, starting with MMX through several versions of SSE to AVX. POWER has its own sets of SIMD instructions, VMX (a.k.a. “AltiVec™”) and VSX, which are different from those on x86 processors. At the most basic level, these capabilities are new processor instructions that make use of the wide registers described above. Processor instructions are not easily utilized directly by C/C++ code. There are efforts to add automatic vectorization during compilation, but this approach is challenging for compiler developers, and is perhaps not as effective as many application developers desire. Thus, application developers often demand direct access to the instructions without the awkward overhead of embedding assembly code within their C/C++ code. Intrinsics provide that mechanism, making the functionality of the vector instructions available for operating on C/C++ data types.
Example (x86 intrinsics):
__m128 a, b; _mm_addsub_ps ( a, b );
Similarly, POWER has support for its vector capabilities through functions provided with GCC and other compilers for POWER. The POWER implementation uses overloading to reduce the set of function names.
vector signed int i,j; i = vec_add ( i, j );
vector unsigned long l,m; l = vec_add ( l, m );
First, how do you know you have code with “x86 vector intrinsics”?
C/C++ code that makes use of x86 vector intrinsics will include one or more of the following files (source: Intel® Intrinsics Guide):
#include "mmintrin.h" // MMX #include "xmmintrin.h" // SSE #include "emmintrin.h" // SSE2 #include "pmmintrin.h" // SSE3 #include "tmmintrin.h" // SSSE3 #include "smmintrin.h" // SSE4.1 #include "nmmintrin.h" // SSE4.2 #include "immintrin.h" // AVX, AVX2, AVX-512
And contain function calls like “_mm*”.
__mm_addsub_ps ( a, b );
When one attempts to compile code that makes use of x86 vector intrinsics on a system without any compatibility support (described below in “Approach 1”), errors similar to the following are reported:
$ gcc -o max-x86 max-x86.c max-x86.c:2:23: fatal error: immintrin.h: No such file or directory #include "immintrin.h" ^ compilation terminated.
So what do I do with C/C++ code that uses x86 vector intrinsics?
The right answer is to carefully analyze the code and rewrite it to use the POWER vector capabilities (see References, below) as provided by GCC and other compilers for POWER. This approach will yield the best performance results.
What follows are simpler approaches for those in a hurry, but it should be carefully noted and accepted that this will very likely not produce code that performs well. Performance may be unacceptable. Restating: the best approach is to rewrite the code.
Approach 1: intrinsics compatibility implementation in GCC
Work is ongoing to incorporate compatible (but possibly poorly performing) implementations of “x86 vector intrinsics” with GCC. This will allow code containing “x86 vector intrinsics” to compile and run on POWER, greatly enhancing portability, possibly at the expense of performance. The concerns about performance are significant enough that the implementation is protected by a
#error preprocessor macro, so the compilation will not proceed:
$ gcc -o max-x86 max-x86.c In file included from max-x86.c:2:0: /opt/at11.0/lib/gcc/powerpc64le-linux-gnu/7.2.1/include/xmmintrin.h:54:2: error: #error "Please read comment above. Use -DNO_WARN_X86_INTRINSICS to disable this error." #error "Please read comment above. Use -DNO_WARN_X86_INTRINSICS to disable this error." ^~~~~
This error raises awareness of the implications of using the compatibility implementation, encouraging the developer to read an explanatory message, by forcing a slight modification to the compilation steps. Defining
NO_WARN_X86_INTRINSICS is sufficient to avoid the preprocessor
#error and allow the compilation to proceed:
$ gcc -o max-x86 max-x86.c -DNO_WARN_X86_INTRINSICS
The x86 intrinsics compatibility implementation for POWER will first appear in GCC 8. At the time of this writing, GCC 8 is not yet released. Fortunately, the same work has been backported to a version of GCC 7 that is available in the IBM Advance Toolchain. The Advance Toolchain is a highly recommended method to get the best performance on POWER systems.
Approach 2: Power Vector Library
The “Power Vector Library” is another compatible implementation of the “x86 vector intrinsics” for use on POWER, but with an incompatible API. One would need to replace each function call with the matching API call in the Power Vector Library.
This Library is expected to eventually be deprecated in favor of the GCC implementation in Approach 1. At the time of this writing, the implementation in Approach 1 is not yet a superset of this approach.
Approach 3: “veclib”
An independent implementation, called “veclib”, has taken the liberty of providing the missing mapping from the “x86 vector intrinsics” API to the Power Vector Library API. With a few simple steps of preparation, the compilation process is very simple, and does not require modifying every API call as in Approach 2.
The author does not have experience with this approach.
Approach 4: “SIMD Everywhere”
Yet another wrapper API, “SIMD Everywhere”, or “SIMDe”, takes the unique approach of providing a cross-platform API for utilizing SIMD capabilities of supported processors. There is at least some support for x86, ARM, and POWER. If the code to be ported does not already make use of SIMDe, then it would have to be ported to SIMDe with the advantage of being more portable as a result. How comprehensive the support is and how performant have not been evaluated.
The author does not have experience with this approach.
Complementary library: “pveclib”
A complementary set of vector-accelerated functions for POWER is available with a newly-released library called “pveclib“, also known as “Power Vector Library” (unfortunately just like the distinct “veclib” in Approach 3). pveclib makes it easier to use POWER vector intrinsics by providing higher level operations and filling in functional gaps between older and newer processor generations.
vec_revq: byte reverse a quadword
vec_revd: byte reverse two doublewords
vec_revw: byte reverse four words
vec_revh: byte reverse eight halfwords
vec_clzq: count leading zeros in a quadword
vec_sldq: shift left double quadword
vec_slqi: shift left quad immediate
vec_srqi: shift right quadword immediate
vec_srq: shift right quadword
vec_pasted: doubleword paste
vec_mulouw: multiply odd unsigned words
- and many, many more
Over time, expect that the x86 vector intrinsics compatibility implementation will become more comprehensive and more performant. However, it is very unlikely that the implementation will achieve parity in performance with an actual x86 processor, nor is this a goal. The current goals are:
- to aid in portability with the caveat that performance will likely be sub-optimal
- to get something “working” quickly in order to allow for more deeper performance analysis
- to get something “working” to give time for analysis and code rewrite as a follow-on effort
Hopefully, the x86 vector intrinsics compatibility implementation will be found useful. Feel free to ask questions in the “Linux on Power” dW Answers forum in this portal.