Home > Net >  How to vectorise int8 multiplcation in C (AVX2)
How to vectorise int8 multiplcation in C (AVX2)

Time:11-05

How do I vectorize this C function with AVX2?

static void propogate_neuron(const short a, const int8_t *b, int *c) {

    for (int i = 0; i < 32;   i){
        c[i]  = a * b[i];
    }

}

CodePudding user response:

You need to add restrict qualifier to mark c that it cannot alias with b.

The issue is that int8_t is likely signed char which can alias with any other type according to strict aliasing rule. Therefore the compiler cannot be sure that setting c[i] will not modify b[i]. The forces the compiler to fetch data on every iteration.

Presence of const does not mean anything because it only limit programmer from modifying data via pointer b.

After replacing the prototype to:

void propogate_neuron(const short a, const int8_t *b, int * restrict c)

the code gets vectorized. See godbolt

CodePudding user response:

GCC already auto-vectorizes that with a check for overlap. Promising that there's no overlap by using int *restrict c lets GCC remove that check, and gets clang to decide to auto-vectorize.

However, clang widens to 32-bit and uses vpmulld which is 2 uops on Haswell and later. (Although it's fully efficient on Zen.) GCC uses vpmullw and vpmulhw to get the low and high halves of 16-bit full multiplies, and shuffles those together. (Godbolt) This is a pretty clunky strategy, especially with -march=znver2 where vpmulld is single uop.

GCC does only have four single-uop multiply instructions, but costs a lot of shuffles to achieve it. We can do better:


Since we only need 8x16 => 32-bit multiplies, we can instead use vpmaddwd which is single-uop on Haswell/Skylake as well as Zen. https://uops.info/table.html

Unfortunately we can't take advantage of the add part since we need to add to a full 32-bit value. We need zeros in the high half of every pair of 16-bit elements to use it as just a 16x16 => 32-bit multiply within each 32-bit element.

#include <immintrin.h>

void propogate_neuron_avx2(const short a, const int8_t *restrict b, int *restrict c) {
   __m256i va = _mm256_set1_epi32( (uint16_t)a );    // [..., 0, a, 0, a] 16-bit elements

   for (int i = 0 ; i < 32 ; i =8) {
       __m256i vb = _mm256_cvtepi8_epi32( _mm_loadl_epi64((__m128i*)&b[i]) );
       __m256i prod = _mm256_madd_epi16(va, vb);
       __m256i sum = _mm256_add_epi32(prod, _mm256_loadu_si256((const __m256i*)&c[i]));
       _mm256_storeu_si256((__m256i*)&c[i], sum);
    }
}

Godbolt:

# clang13.0 -O3 -march=haswell
        movzx   eax, di
        vmovd   xmm0, eax                     # 0:a  16-bit halves
        vpbroadcastd    ymm0, xmm0            # repeated to every element

        vpmovsxbd       ymm1, qword ptr [rsi]  # xx:b 16-bit halves
        vpmaddwd        ymm1, ymm0, ymm1       # 0   a*b in each 32-bit element
        vpaddd  ymm1, ymm1, ymmword ptr [rdx]
        vmovdqu ymmword ptr [rdx], ymm1

... repeated 3 more times, 8 elements per vector

        vpmovsxbd       ymm1, qword ptr [rsi   8]
        vpmaddwd        ymm1, ymm0, ymm1
        vpaddd  ymm1, ymm1, ymmword ptr [rdx   32]
        vmovdqu ymmword ptr [rdx   32], ymm1

If saving a uop per vector multiply makes a measurable performance difference, it might be worth the trouble of manually vectorizing in the source.

It's a missed optimization that GCC / clang don't do this in the first place when auto-vectorizing your pure C code.

If anyone wants to report this, leave a comment here. Otherwise I might get around to it. IDK if patterns like this are frequent enough for GCC / LLVM's optimizers to want to look for this pattern. Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2x 16x16 => 32-bit with horizontal add).

  • Related