I have a avx2(256 bit) SIMD vector of bytes that is padded with zeros in front and in the back that looks like this:
[0, 2, 3, ..., 4, 5, 0, 0, 0]
.
The amount of zeros in the front is not known compile-time.
How would I efficiently shift/rotate the zeros such that it would look like this:
[2, 3, 4, 5, ..., 0, 0, 0, 0]
?
CodePudding user response:
AVX2 has no way to do a lane-crossing shuffle with granularity smaller than 4 bytes. In this case, you'd want AVX-512 VBMI vpermb
(in Ice Lake). If you had that, perhaps vpcmpeqb
/ vpmovmskb
/ tzcnt
on the mask, and use that as an offset to load a window of 32 bytes from a constant array of alignas(64) int8_t shuffles = {0,1,2,...,31, 0, 1, 2, ... 31};
. That's your shuffle-control vector for vpermb
.
Without AVX-512 VBMI, it might make sense to store twice and do an unaligned reload spanning them, despite the store-forwarding stall. That would be good for throughput if you need this for one vector between lots of other work, but bad for doing this in a loop without much other work.
Store-forwarding stalls don't pipeline with each other, but can pipeline with successful store-forwarding. So if you just need this for one vector occasionally, and out-of-order exec can hide the latency, it's not many uops to vpcmpeqb/tzcnt or lzcnt to get a load offset.
CodePudding user response:
If your types are bigger than 32bits.
I can't quite understand the documentation on _mm256_permutevar8x32_epi32
but in practise, adding offset to identity permutation does a rotate - which is what you want (when you already got the number of leading 0s).
__m256i rotate_i32(__m256i w, int offset) {
__m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset));
return _mm256_permutevar8x32_epi32(w, shuffle);
}
Here is the godbolt: https://godbolt.org/z/Kv8oxs6oY
(-1, -2, -3, -4, -5, -6, -7, -8)
(-2, -3, -4, -5, -6, -7, -8, -1)
(-3, -4, -5, -6, -7, -8, -1, -2)
(-4, -5, -6, -7, -8, -1, -2, -3)
(-5, -6, -7, -8, -1, -2, -3, -4)
(-6, -7, -8, -1, -2, -3, -4, -5)
(-7, -8, -1, -2, -3, -4, -5, -6)
(-8, -1, -2, -3, -4, -5, -6, -7)
The same trick works for 64 bits, but you need to mutliply offset by 2.
__m256i rotate_i64(__m256i w, int offset) {
__m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
__m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset * 2));
return _mm256_permutevar8x32_epi32(w, shuffle);
}
Godbolt: https://godbolt.org/z/85h6aWPsW
Output:
(-1, -2, -3, -4)
(-2, -3, -4, -1)
(-3, -4, -1, -2)
(-4, -1, -2, -3)