Kyle Hegeman

Programming and Stuff.

Using Intrinsics

Introduction

Intrinsic functions are a C API that allow developers to issue assembly instrunctions without the complexity of using inline assembly. See this link for a full list of all the current and future functions available for different processors. Note that it is the responsibility of the developer to make sure that the function being used is available on the machine the program is running on. A crash will occur if a function is called that is not available.

Polynomial Function

In this post I am going to explain how to use AVX intrinsics to implement a vectorized version of the the same polynomial function from the previous post on auto vectorization.

1
2
3
4
static float polynomial(float r)
{
   return r*r*r*(10+r*(-15+r*6));
}

Simple example

This function takes a vector of 8 floats and adds a constant value of 6 to each float. The type __m256 is a 256 bit floating point vector containing 8 32-bit floats. There are similar types for integer __m256i and 4 64-bit double vector __m256d.

1
2
3
4
5
6
7
8
__m256 test(__m256 a)
{
  //load the value 6.0f into all 8 values of vector const_6
  __m256 const_6 = _mm256_set1_ps(6.0f);

  //add 6.0f to all values of vector a
  return _mm256_add_ps(const_6, a);
}

This generates the following assembly with g++ 4.8

1
2
vaddps  ymm0, ymm0, YMMWORD PTR .LC0[rip]
ret

Notice that it’s not a direct mapping of the calls that we made. The compiler has opted to use a version of the vector add that adds a register and a memory location where the constant is stored. The constant is stored as 8 values at location LC0

1
2
3
4
5
6
7
8
9
10
    .align 32
.LC0:
    .long   1086324736
    .long   1086324736
    .long   1086324736
    .long   1086324736
    .long   1086324736
    .long   1086324736
    .long   1086324736
    .long   1086324736

Full Intrinsic Implementation

To vectorize this function, we need access to the full array of input values. So the interface to the polynomial function must change. I have elected to use an interface similar to that of the clang 3.4 transformation at the end of the recent post on auto vectorization . Given an array of r_values, return ret[i] = polynomial(r_values[i]) for all i.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void polynomial(float *ret, const float *const r_values, int num) {
  // r*r*r*(10+r*(-15+r*6));

  __m256 const_6 = _mm256_set1_ps(6.0f);
  __m256 const_neg_15 = _mm256_set1_ps(-15.0f);
  __m256 const_10 = _mm256_set1_ps(10.0f);
  // constants

  const int loop_factor = 8;

  for (int i = 0; i < num; i+=loop_factor) {
    __m256 r;
    __m256 left;
    __m256 right;
    // aligned load of 256 bits r
    r = _mm256_load_ps(&r_values[i]);
    left = _mm256_mul_ps(r, r); // r * r

    right = _mm256_mul_ps(r, const_6); // r * 6
    left = _mm256_mul_ps(left, r); // r * r * r
    right = _mm256_add_ps(right, const_neg_15); //-15 + r * 6
    right = _mm256_mul_ps(right, r); //r * (-15 + r * 6)
    right = _mm256_add_ps(right, const_10); //10 + (r * (-15 + r * 6))

    right = _mm256_mul_ps(right, left); // r*r*r *(10 + r * (-15 + r * 6))

    _mm256_store_ps(&ret[i], right); // store 8 values to ret[i]
}

There are a few new intrinsic functions used in this implementation that were not used in the previous example. The first is _mm256_load_ps. This function copies 8 floats from a memory location to a local variable. Note that for this particular method of loading, the pointer must be 32 byte aligned. If it is not, bad things will happen.

Multiplication of two vectors uses the same interface as addition.

Once the computation is finished, the result needs to be stored to the output array. The final call _mm256_store_ps performs a copy from a local variable to a memory location.

Generated assembly

The following is the assembly generated for the loop by g++ 4.8

1
2
3
4
5
6
7
8
9
10
11
12
13
.L20:
        vmovaps ymm0, YMMWORD PTR [rsi+rax]
        vmulps  ymm2, ymm0, ymm5
        vaddps  ymm2, ymm2, ymm4
        vmulps  ymm1, ymm0, ymm0
        vmulps  ymm1, ymm1, ymm0
        vmulps  ymm0, ymm2, ymm0
        vaddps  ymm0, ymm0, ymm3
        vmulps  ymm0, ymm0, ymm1
        vmovaps YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, rdx
        jne     .L20

In this case , a pretty straightforward mapping of the intrinsics is produced.

Driver changes

The code for the updated code for the driver follows. Of note are the two intrinsics used for memory management. _mm_malloc operates like a standard malloc, but it takes an additional parameter that specifies the desired alignment. In this case, a 32 byte alignment. When this allocation method is used, memory must be freed by the corresponding _mm_free call.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
int main(int argc, char *argv[]) {
  const int test_size = 4096 * 2;
  std::normal_distribution<float> dist(0.5f, 0.5f);
  rng.seed(0);
  float *r_values = nullptr;
  float *ret = nullptr;
  double sum = 0.0f;
  std::function<float()> rnd = std::bind(dist, rng);


  r_values = static_cast<float*>(_mm_malloc(sizeof(float) * test_size, 32));

  ret = static_cast<float*>(_mm_malloc(sizeof(float) * test_size, 32));

  if (!(r_values && ret))
    goto exit;

  std::generate_n(r_values, test_size, rnd);

  for (int i = 0; i < 100000; ++i) {
    polynomial(ret, r_values, test_size);

    sum = std::accumulate(ret, ret + test_size, sum);
  }
  std::cout << sum << std::endl;

exit:
  _mm_free(r_values);
  _mm_free(ret);


}

Conclusion

Instrinsics can be used to produce vectorized code across many different compilers. It is not difficult to use if the target machine is known ahead of time. But, if the target capabilities of the machines can vary then things get more complicated for the developer. For example, to support machines with SSE but without AVX a second function for SSE would have to be implemented and maintained. One potential solution to this is to use a 3rd party library such as ispc. I don’t have any personal experience with this library yet, a possible topic for a future post.

Comments