Alignment attribute to force aligned load/store in auto-vectorization of GCC/CLang-CodePudding

It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.

Also it is known that there exist alignas() standard C attribute, which among other uses also allows to align stack variable, for example following code:

Try it online!

#include <cstdint>
#include <iostream>

int main() {
    alignas(1024) int x[3] = {1, 2, 3};
    alignas(1024) int (&y)[3] = *(&x);

    std::cout << uint64_t(&x) % 1024 << " "
        << uint64_t(&x) % 16384 << std::endl;
    std::cout << uint64_t(&y) % 1024 << " "
        << uint64_t(&y) % 16384 << std::endl;
}

Outputs:

0 9216
0 9216

which means that both x and y are aligned on stack on 1024 bytes but not 16384 bytes.

Lets now see another code:

Try it online!

#include <cstdint>

void f(uint64_t * x, uint64_t * y) {
    for (int i = 0; i < 16;   i)
        x[i] ^= y[i];
}

if compiled with -std=c 20 -O3 -mavx512f attributes on GCC it produces following asm code (provided part of code):

        vmovdqu64       zmm1, ZMMWORD PTR [rdi]
        vpxorq  zmm0, zmm1, ZMMWORD PTR [rsi]
        vmovdqu64       ZMMWORD PTR [rdi], zmm0
        vmovdqu64       zmm0, ZMMWORD PTR [rsi 64]
        vpxorq  zmm0, zmm0, ZMMWORD PTR [rdi 64]
        vmovdqu64       ZMMWORD PTR [rdi 64], zmm0

which two times does AVX-512 unaligned load xor unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.

My question is how to tell GCC that provided to function pointers x and y are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64) like in code above, I can force GCC to use aligned load (vmovdqa64). It is known that aligned load/store can be considerably faster.

My first try to force GCC to do aligned load/store was through following code:

Try it online!

#include <cstdint>

void  g(uint64_t (&x_)[16],
        uint64_t const (&y_)[16]) {

    alignas(64) uint64_t (&x)[16] = x_;
    alignas(64) uint64_t const (&y)[16] = y_;

    for (int i = 0; i < 16;   i)
        x[i] ^= y[i];
}

but this code still produces unaligned load (vmovdqu64) same as in asm code above (of previous code snippet). Hence this alignas(64) hint doesn't give anything useful to improve GCC assembly code.

My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()?

If possible I need solutions for all of GCC/CLang/MSVC.

CodePudding user response：

Though not entirely portable for all compilers, __builtin_assume_aligned will tell GCC to assume the pointer are aligned.

I often use a different strategy that is more portable using a helper struct:

template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
    static const size_t bits = Bits;
    static const size_t size = bits/64;
    
    std::array<uint64_t,size> v;
    
    uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size;   i) v[i] &= v2.v[i]; return *this; }
    uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size;   i) v[i] ^= v2.v[i]; return *this; }
    uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size;   i) v[i] |= v2.v[i]; return *this; }
    uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
    uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
    uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
    uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size;   i) tmp.v[i] = ~v[i]; return tmp; }
    bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size;   i) if (v[i] != v2.v[i]) return false; return true; }
    bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size;   i) if (v[i] != v2.v[i]) return true; return false; }
    
    bool get_bit(size_t c) const   { return (v[c/64]>>(c%64))&1; }
    void set_bit(size_t c)         { v[c/64] |= uint64_t(1)<<(c%64); }
    void flip_bit(size_t c)        { v[c/64] ^= uint64_t(1)<<(c%64); }
    void clear_bit(size_t c)       { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
    void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
    size_t hammingweight() const   { size_t w = 0; for (size_t i = 0; i < size;   i) w  = mccl::hammingweight(v[i]); return w; }
    bool parity() const            { uint64_t x = 0; for (size_t i = 0; i < size;   i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};

and then convert the pointer to uint64_t to a pointer to this struct using reinterpret_cast.

Converting a loop over uint64_t into a loop over these blocks typically auto vectorize very well.

CodePudding user response：

As I imply from your own answer, you're interested in MSVC solution too.

MSVC understands the proper use of alignas as well as its own __declspec(align), it also understands __builtin_assume_aligned, but it intentionally does not want to do anything with known alignment.

My report closed as "Duplicate":

Compiler does not take advantage of alignment provided by alignas or __builtin_assume_aligned

The related reports closed as "Not a bug":

~~MSVC still takes advantage of alignment of global variables, if it can observe that the pointer points to the global variable.~~ Even this does not work in every case.

CodePudding user response：

Just now @MarcStevens suggested a working solution for my Question, through using __builtin_assume_aligned:

Try it online!

#include <cstdint>

void f(uint64_t * x_, uint64_t * y_) {
    uint64_t * x = (uint64_t *)__builtin_assume_aligned(x_, 64);
    uint64_t * y = (uint64_t *)__builtin_assume_aligned(y_, 64);

    for (int i = 0; i < 16;   i)
        x[i] ^= y[i];
}

It actually produces code with aligned vmovdqa64 instruction.

But only GCC produces aligned instruction. CLang still uses unaligned, see here, also CLang uses AVX-512 registers only with more than 16 elements.

So still CLang and also MSVC solutions are welcome.