Auto-vectorization
What is it?
One way of increasing the performance of loops is by triggering auto-vectorization methods built into the gcc compiler. Vectorization will speed up your build when the vector type operation are simple and executed multiple times. At the lowest level, gcc exposes compiler instructions that match one-to-one with the vector instructions on the CPU.
Vector instructions are instructions that the CPU implements by setting one-dimensional array. Therefore, it is possible to write code using them in a higher level language such as C/C++, and still have nearly complete control over the output.
Successfully activating auto-vectorization isn’t an easy process and most often requires code to be re-written. By default, the compiler will not vectorize the program unless you compile using the -O2 flag and up
. To get an idea of how vectorization works, the comparison below demonstrates that without vectorization enabled there are 3 x 32-bit unused integers.
How do I enable it?
- creates two 1000-element integer arrays
- fills both arrays with random numbers in the range -1000 to +1000
- sums both arrays element-by-element to a third array
- sum the third array and display the result
- auto enable gcc auto-vectorization
Confirming a successful auto-vectorization can be a little tricky. Using the gcc compiler flags-O3
will enable optimizations but it won’t output any messages to the console.
After a little digging around the gcc docs I came across the following flags:
-fopt-info-vec-missed
//display information about vectors that couldn’t be optimized-fopt-info-loop-optimized
//display successfully auto-vectorized loops
Demo Files
Download the files I used for the following examples: c_loop_autovect_lab.zip
unzip c_loop_autovect_lab.zip
cd c_loop_autovect_lab
make
//build the files-
make run
//execute all binaries and create objdump outputmake miss
//display all the missed optimizationsmake opt
//display all vectorized loopsmake clean
//cleans up binaries, *.miss , *.opt, *.txt
Examples
Let’s start with a simple for
loop with statically defined arrays and a long data type to store the result
When we look at the loop_vect_v0.miss file we notice the gcc compiler complaining that there are not enough data references to auto-vectorize the loop.
What if we re-code our program and create a separate for
loop to handle the additions.
Success! It looks like the compiler auto-vectorized the second for
loop.
The console returns the following message after compiling
loop_vect_v2.c:21:1: note: loop vectorized
loop_vect_v2.c:21:1: note: loop turned into non-loop; it never loops.
loop_vect_v2.c:21:1: note: loop with 3 iterations completely unrolled
loop_vect_v2.c:7:5: note: loop turned into non-loop; it never loopsIf we run compile with the -fopt-info-loop-vec-all
flag we can view complete detailed of all the optimizations.
loop_vect_v1.c:19:1: note: ===== analyze_loop_nest =====
loop_vect_v1.c:19:1: note: === vect_analyze_loop_form ===
loop_vect_v1.c:19:1: note: === get_loop_niters ===
loop_vect_v1.c:19:1: note: === vect_analyze_data_refs ===
loop_vect_v1.c:19:1: note: got vectype for stmt: _9 = array1[i_41];
vector(4) int
loop_vect_v1.c:19:1: note: got vectype for stmt: _10 = array2[i_41];
vector(4) int
loop_vect_v1.c:19:1: note: Cost model analysis:
Vector inside of loop cost: 4
Vector prologue cost: 1
Vector epilogue cost: 2
Scalar iteration cost: 4
Scalar outside cost: 0
Vector outside cost: 3
prologue iterations: 0
epilogue iterations: 0It looks like the simpler the loop is the higher chance of auto-vectorization. In the last example, the compiler confirmed that the arrays were aligned and could completely unroll the loop.
Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program’s execution speed at the expense of its binary size, which is an approach known as the space–time tradeoff. -Wikipedia
Reflection
Although gcc’s auto-vectorization can increase build performance it may not be practical for certain applications. There are many restrictions conditions to consider auto-vectorization. gcc needs confirmation that arrays are aligned and data is aligned. Also, code will most likely have to be re-written to simplify loop functionality and even then auto-vectorization isn’t guaranteed.