Quickest way to shift/rotate byte vector with SIMD-CodePudding

I have a avx2(256 bit) SIMD vector of bytes that is padded with zeros in front and in the back that looks like this: [0, 2, 3, ..., 4, 5, 0, 0, 0]. The amount of zeros in the front is not known compile-time.

How would I efficiently shift/rotate the zeros such that it would look like this: [2, 3, 4, 5, ..., 0, 0, 0, 0]?

CodePudding user response：

AVX2 has no way to do a lane-crossing shuffle with granularity smaller than 4 bytes. In this case, you'd want AVX-512 VBMI vpermb (in Ice Lake). If you had that, perhaps vpcmpeqb / vpmovmskb / tzcnt on the mask, and use that as an offset to load a window of 32 bytes from a constant array of alignas(64) int8_t shuffles = {0,1,2,...,31, 0, 1, 2, ... 31};. That's your shuffle-control vector for vpermb.

Without AVX-512 VBMI, it might make sense to store twice and do an unaligned reload spanning them, despite the store-forwarding stall. That would be good for throughput if you need this for one vector between lots of other work, but bad for doing this in a loop without much other work.

Store-forwarding stalls don't pipeline with each other, but can pipeline with successful store-forwarding. So if you just need this for one vector occasionally, and out-of-order exec can hide the latency, it's not many uops to vpcmpeqb/tzcnt or lzcnt to get a load offset.

CodePudding user response：

If your types are bigger than 32bits.

I can't quite understand the documentation on _mm256_permutevar8x32_epi32 but in practise, adding offset to identity permutation does a rotate - which is what you want (when you already got the number of leading 0s).

__m256i rotate_i32(__m256i w, int offset) {
    __m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
    __m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset));
    return _mm256_permutevar8x32_epi32(w, shuffle);
}

Here is the godbolt: https://godbolt.org/z/Kv8oxs6oY

(-1, -2, -3, -4, -5, -6, -7, -8)
(-2, -3, -4, -5, -6, -7, -8, -1)
(-3, -4, -5, -6, -7, -8, -1, -2)
(-4, -5, -6, -7, -8, -1, -2, -3)
(-5, -6, -7, -8, -1, -2, -3, -4)
(-6, -7, -8, -1, -2, -3, -4, -5)
(-7, -8, -1, -2, -3, -4, -5, -6)
(-8, -1, -2, -3, -4, -5, -6, -7)

The same trick works for 64 bits, but you need to mutliply offset by 2.

__m256i rotate_i64(__m256i w, int offset) {
    __m256i identity = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
    __m256i shuffle = _mm256_add_epi32(identity, _mm256_set1_epi32(offset * 2));
    return _mm256_permutevar8x32_epi32(w, shuffle);
}

Godbolt: https://godbolt.org/z/85h6aWPsW

Output:

(-1, -2, -3, -4)
(-2, -3, -4, -1)
(-3, -4, -1, -2)
(-4, -1, -2, -3)