It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.
Also it is known that there exist alignas() standard C attribute, which among other uses also allows to align stack variable, for example following code:
#include <cstdint>
#include <iostream>
int main() {
alignas(1024) int x[3] = {1, 2, 3};
alignas(1024) int (&y)[3] = *(&x);
std::cout << uint64_t(&x) % 1024 << " "
<< uint64_t(&x) % 16384 << std::endl;
std::cout << uint64_t(&y) % 1024 << " "
<< uint64_t(&y) % 16384 << std::endl;
}
Outputs:
0 9216
0 9216
which means that both x
and y
are aligned on stack on 1024 bytes but not 16384 bytes.
Lets now see another code:
#include <cstdint>
void f(uint64_t * x, uint64_t * y) {
for (int i = 0; i < 16; i)
x[i] ^= y[i];
}
if compiled with -std=c 20 -O3 -mavx512f
attributes on GCC it produces following asm code (provided part of code):
vmovdqu64 zmm1, ZMMWORD PTR [rdi]
vpxorq zmm0, zmm1, ZMMWORD PTR [rsi]
vmovdqu64 ZMMWORD PTR [rdi], zmm0
vmovdqu64 zmm0, ZMMWORD PTR [rsi 64]
vpxorq zmm0, zmm0, ZMMWORD PTR [rdi 64]
vmovdqu64 ZMMWORD PTR [rdi 64], zmm0
which two times does AVX-512 unaligned load xor unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.
My question is how to tell GCC that provided to function pointers x
and y
are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64
) like in code above, I can force GCC to use aligned load (vmovdqa64
). It is known that aligned load/store can be considerably faster.
My first try to force GCC to do aligned load/store was through following code:
#include <cstdint>
void g(uint64_t (&x_)[16],
uint64_t const (&y_)[16]) {
alignas(64) uint64_t (&x)[16] = x_;
alignas(64) uint64_t const (&y)[16] = y_;
for (int i = 0; i < 16; i)
x[i] ^= y[i];
}
but this code still produces unaligned load (vmovdqu64
) same as in asm code above (of previous code snippet). Hence this alignas(64)
hint doesn't give anything useful to improve GCC assembly code.
My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()
?
If possible I need solutions for all of GCC/CLang/MSVC.
CodePudding user response:
Though not entirely portable for all compilers, __builtin_assume_aligned
will tell GCC to assume the pointer are aligned.
I often use a different strategy that is more portable using a helper struct:
template<size_t Bits>
struct alignas(Bits/8) uint64_block_t
{
static const size_t bits = Bits;
static const size_t size = bits/64;
std::array<uint64_t,size> v;
uint64_block_t& operator&=(const uint64_block_t& v2) { for (size_t i = 0; i < size; i) v[i] &= v2.v[i]; return *this; }
uint64_block_t& operator^=(const uint64_block_t& v2) { for (size_t i = 0; i < size; i) v[i] ^= v2.v[i]; return *this; }
uint64_block_t& operator|=(const uint64_block_t& v2) { for (size_t i = 0; i < size; i) v[i] |= v2.v[i]; return *this; }
uint64_block_t operator&(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp &= v2; }
uint64_block_t operator^(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp ^= v2; }
uint64_block_t operator|(const uint64_block_t& v2) const { uint64_block_t tmp(*this); return tmp |= v2; }
uint64_block_t operator~() const { uint64_block_t tmp; for (size_t i = 0; i < size; i) tmp.v[i] = ~v[i]; return tmp; }
bool operator==(const uint64_block_t& v2) const { for (size_t i = 0; i < size; i) if (v[i] != v2.v[i]) return false; return true; }
bool operator!=(const uint64_block_t& v2) const { for (size_t i = 0; i < size; i) if (v[i] != v2.v[i]) return true; return false; }
bool get_bit(size_t c) const { return (v[c/64]>>(c%64))&1; }
void set_bit(size_t c) { v[c/64] |= uint64_t(1)<<(c%64); }
void flip_bit(size_t c) { v[c/64] ^= uint64_t(1)<<(c%64); }
void clear_bit(size_t c) { v[c/64] &= ~(uint64_t(1)<<(c%64)); }
void set_bit(size_t c, bool b) { v[c/64] &= ~(uint64_t(1)<<(c%64)); v[c/64] |= uint64_t(b ? 1 : 0)<<(c%64); }
size_t hammingweight() const { size_t w = 0; for (size_t i = 0; i < size; i) w = mccl::hammingweight(v[i]); return w; }
bool parity() const { uint64_t x = 0; for (size_t i = 0; i < size; i) x ^= v[i]; return mccl::hammingweight(x)%2; }
};
and then convert the pointer to uint64_t to a pointer to this struct using reinterpret_cast.
Converting a loop over uint64_t into a loop over these blocks typically auto vectorize very well.
CodePudding user response:
As I imply from your own answer, you're interested in MSVC solution too.
MSVC understands the proper use of alignas
as well as its own __declspec(align)
, it also understands __builtin_assume_aligned
, but it intentionally does not want to do anything with known alignment.
My report closed as "Duplicate":
The related reports closed as "Not a bug":
- [MSConnect 3068950] - C : MOVUPS is generated for alignof(16) data instead of MOVAPS
- Regression (from VS 2015) in SSSE/AVX instructions generation ((V)MOVUPS instead of (V)MOVAPS)
MSVC still takes advantage of alignment of global variables, if it can observe that the pointer points to the global variable. Even this does not work in every case.
CodePudding user response:
Just now @MarcStevens suggested a working solution for my Question, through using __builtin_assume_aligned:
#include <cstdint>
void f(uint64_t * x_, uint64_t * y_) {
uint64_t * x = (uint64_t *)__builtin_assume_aligned(x_, 64);
uint64_t * y = (uint64_t *)__builtin_assume_aligned(y_, 64);
for (int i = 0; i < 16; i)
x[i] ^= y[i];
}
It actually produces code with aligned vmovdqa64
instruction.
But only GCC produces aligned instruction. CLang still uses unaligned, see here, also CLang uses AVX-512 registers only with more than 16 elements.
So still CLang and also MSVC solutions are welcome.