Intrinsic functions are a C API that allow developers to issue assembly instrunctions without the complexity of using inline assembly. See this link for a full list of all the current and future functions available for different processors. Note that it is the responsibility of the developer to make sure that the function being used is available on the machine the program is running on. A crash will occur if a function is called that is not available.
In this post I am going to explain how to use AVX intrinsics to implement a vectorized version of the the same polynomial function from the previous post on auto vectorization.
1 2 3 4
This function takes a vector of 8 floats and adds a constant value of 6 to each float. The type __m256 is a 256 bit floating point vector containing 8 32-bit floats. There are similar types for integer __m256i and 4 64-bit double vector __m256d.
1 2 3 4 5 6 7 8
This generates the following assembly with g++ 4.8
Notice that it’s not a direct mapping of the calls that we made. The compiler has opted to use a version of the vector add that adds a register and a memory location where the constant is stored. The constant is stored as 8 values at location LC0
1 2 3 4 5 6 7 8 9 10
Full Intrinsic Implementation
To vectorize this function, we need access to the full array of input values. So the interface to the polynomial function must change. I have elected to use an interface similar to that of the clang 3.4 transformation at the end of the recent post on auto vectorization . Given an array of r_values, return ret[i] = polynomial(r_values[i]) for all i.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
There are a few new intrinsic functions used in this implementation that were not used in the previous example. The first is _mm256_load_ps. This function copies 8 floats from a memory location to a local variable. Note that for this particular method of loading, the pointer must be 32 byte aligned. If it is not, bad things will happen.
Multiplication of two vectors uses the same interface as addition.
Once the computation is finished, the result needs to be stored to the output array. The final call _mm256_store_ps performs a copy from a local variable to a memory location.
The following is the assembly generated for the loop by g++ 4.8
1 2 3 4 5 6 7 8 9 10 11 12 13
In this case , a pretty straightforward mapping of the intrinsics is produced.
The code for the updated code for the driver follows. Of note are the two intrinsics used for memory management. _mm_malloc operates like a standard malloc, but it takes an additional parameter that specifies the desired alignment. In this case, a 32 byte alignment. When this allocation method is used, memory must be freed by the corresponding _mm_free call.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Instrinsics can be used to produce vectorized code across many different compilers. It is not difficult to use if the target machine is known ahead of time. But, if the target capabilities of the machines can vary then things get more complicated for the developer. For example, to support machines with SSE but without AVX a second function for SSE would have to be implemented and maintained. One potential solution to this is to use a 3rd party library such as ispc. I don’t have any personal experience with this library yet, a possible topic for a future post.