Efficiently shift-or large bit vector-CodePudding

I have large in-memory array as some pointer uint64_t * arr (plus size), which represents plain bits. I need to very efficiently (most performant/fast) shift these bits to the right by some amount from 0 to 63.

By shifting whole array I mean not to shift each element (like a[i] <<= Shift), but to shift it as a single large bit vector. In other words for each intermediate position i (except for first and last element) I can do following in a loop:

dst[i] = w | (src[i] << Shift);
w = src[i] >> (64 - Shift);

where w is some temporary variable, holding right-shifted value of previous array element.

This solution above is simple and obvious. But I need something more efficient as I have giga-bytes of data.

Ideally would be to use some SIMD instructions for that, so I'm looking for SIMD suggestions from experts. I need to implement shifting code for all four types of popular instruction sets - SSE-SSE4.2 / AVX / AVX-2 / AVX-512.

But as far as I know for example for SSE2 there exists only _mm_slli_si128() intrinsic/instruction, which shifts only by amount multiple of 8 (in other words byte-shifting). And I need shifting by arbitrary bit-size, not only byte-shift.

Without SIMD I can shift also by 128 bits at once through using shld reg, reg, reg instruction, which allows to do 128-bit shifting. It is implemented as intrinsic __shiftleft128() in MSVC, and produces assembler code that can be seen here.

BTW, I need solutions for all of MSVC/GCC/CLang.

Also inside single loop iteration I can shift 4 or 8 words in sequential operations, this will use CPU pipelining to speedup parallel out-of-order execution of several instructions.

If needed my bit vector can be aligned to any amount of bytes in memory, if this will help for example to improve SIMD speed by doing aligned reads/writes. Also source and destination bit vector memory are different (non-overlapping).

In other words I'm looking for all the suggestions about how to solve my task most efficiently (most performantly) on different Intel CPUs.

Note, to clarify, I actually have to do several shift-ors, not just single shift. I have large bit vector X, and several hundreds of shift sizes s0, s1, ..., sN, where each shift size is different and can be also large (for example shift by 100K bits), then I want to compute resulting large bit vector Y = (X << s0) | (X << s1) | ... | (X << sN). I just simplified my question for StackOverflow to shifting single vector. But probably this detail about original task is very important.

As requested by @Jake'Alquimista'LEE, I decided to implement a ready-made toy minimal reproducible example of what I want to do, computing shift-ors of input bit vector src to produced or-ed final dst bit vector. This example is not optimized at all, just a straightforward simple variant of how my task can be solved. For simplicity this example has small size of input vector, not giga-bytes as in my case. It is a toy example, I didn't check if it solves task correctly, it may contain minor bugs:

Try it online!

#include <cstdint>
#include <vector>
#include <random>

#define bit_sizeof(x) (sizeof(x) * 8)

using u64 = uint64_t;
using T = u64;

int main() {
    std::mt19937_64 rng{123};

    // Random generate source bit vector
    std::vector<T> src(100'000);
    for (size_t i = 0; i < src.size();   i)
        src[i] = rng();

    size_t const src_bitsize = src.size() * bit_sizeof(T);

    // Destination bit vector, for example twice bigger in size
    std::vector<T> dst(src.size() * 2);

    // Random generate shifts
    std::vector<u64> shifts(200);
    for (size_t i = 0; i < shifts.size();   i)
        shifts[i] = rng() % src_bitsize;

    // Right-shift that handles overflow
    auto Shr = [](auto x, size_t s) {
        return s >= bit_sizeof(x) ? 0 : (x >> s);
    };

    // Do actual Shift-Ors
    for (auto orig_shift: shifts) {
        size_t const
            word_off = orig_shift / bit_sizeof(T),
            bit_off = orig_shift % bit_sizeof(T);

        if (word_off >= dst.size())
            continue;
        
        size_t const
            lim = std::min(src.size(), dst.size() - word_off);

        T w = 0;
        
        for (size_t i = 0; i < lim;   i) {
            dst[word_off   i] |= w | (src[i] << bit_off);
            w = Shr(src[i], bit_sizeof(T) - bit_off);
        }

        // Special case of handling for last word
        if (word_off   lim < dst.size())
            dst[word_off   lim] |= w;
    }
}

My real project's current code is different from toy example above. This project already solves correctly a real-world task. I just need to do extra optimizations. Some optimizations I already did, like using OpenMP to parallelize shift-or operations on all cores. Also as said in comments, I created specialized templated functions for each shift size, 64 functions in total, and choosing one of 64 functions to do actual shift-or. Each C function has compile time value of shift size, hence compiler does extra optimizations taking into account compile time values.

CodePudding user response：

You can, and possibly you don't even need to use SIMD instructions explicitly. The target compilers GCC, CLANG and MSVC and other compilers like ICC all support auto-vectorization.

For instance a simple shiftvec version

void shiftvec(uint64_t* dst, uint64_t* src, int size, int shift)
{
    for (int i = 0; i < size;   i,  src,  dst)
    {
        *dst = ((*src)<<shift) | (*(src 1)>>(64-shift));
    }
}

compiled with a recent GCC (or CLANG works as well) and -O3 -std=c 11 -mavx2 leads to SIMD instructions in the core loop of the assembly

.L5:
  vmovdqu ymm2, YMMWORD PTR [rcx rax]
  vmovdqu ymm3, YMMWORD PTR [rcx 8 rax]
  vpsllq ymm0, ymm2, 5
  vpsrlq ymm1, ymm3, 59
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rdi rax], ymm0
  add rax, 32
  cmp rax, rdx
  jne .L5

See on godbolt.org: https://godbolt.org/z/WafqGTTW4

This also generalizes if you want to do combine multiple shifts in dst:

void shiftvec2(uint64_t* dst, uint64_t* src1, uint64_t* src2, int size1, int size2, int shift1, int shift2)
{
    int size = size1<size2 ? size1 : size2;
    for (int i = 0; i < size;   i,  src1,  src2,  dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1 1)>>(64-shift1));
        *dst |= ((*src2)<<shift2) | (*(src2 1)>>(64-shift2)); 
    }
    for (int i = size; i < size1;   i,  src1,  dst)
    {
        *dst = ((*src1)<<shift1) | (*(src1 1)>>(64-shift1));        
    }
    for (int i = size; i < size2;   i,  src2,  dst)
    {
        *dst = ((*src2)<<shift2) | (*(src2 1)>>(64-shift2));
    }
}

compiles to a core-loop:

.L38:
  vmovdqu ymm7, YMMWORD PTR [rsi rcx]
  vpsllq ymm1, ymm7, xmm4
  vmovdqu ymm7, YMMWORD PTR [rsi 8 rcx]
  vpsrlq ymm0, ymm7, xmm6
  vpor ymm1, ymm1, ymm0
  vmovdqu YMMWORD PTR [rax rcx], ymm1
  vmovdqu ymm7, YMMWORD PTR [rdx rcx]
  vpsllq ymm0, ymm7, xmm3
  vmovdqu ymm7, YMMWORD PTR [rdx 8 rcx]
  vpsrlq ymm2, ymm7, xmm5
  vpor ymm0, ymm0, ymm2
  vpor ymm0, ymm0, ymm1
  vmovdqu YMMWORD PTR [rax rcx], ymm0
  add rcx, 32
  cmp r10, rcx
  jne .L38

The limit in how many you can combine is of course limited by available registers.

CodePudding user response：

I would attempt to rely on x64 ability to read from unaligned addresses, and to do that with almost no visible penalty when stars are properly (un)aligned. One would only need to handle a few cases of (shift % 8) or (shift % 16) -- all doable with SSE2 instruction set, fixing the remainder with zeros and having an unaligned offset to the data vector and addressing the UB by memcpy.

That said, the inner loop would look like:

uint16_t const *ptr;
auto a = _mm_loadu_si128((__m128i*)ptr);
auto b = _mm_loadu_si128((__m128i*)(ptr   1);
a = _mm_srl_epi16(a, c);
b = _mm_sll_epi16(a, 16 - c);
_mm_storeu_si128((__m128i*)ptr, mm_or_si128(a,b));
ptr  = 8;

Unrolling this loop a few times, one might be able to use _mm_alignr_epi8 on SSE3 to relax memory bandwidth (and those pipeline stages that need to combine results from unaligned memory accesses):

auto a0 = w; 
auto a1 = _mm_load_si128(m128ptr   1);
auto a2 = _mm_load_si128(m128ptr   2);
auto a3 = _mm_load_si128(m128ptr   3);
auto a4 = _mm_load_si128(m128ptr   4);
auto b0 = _mm_alignr_epi8(a1, a0, 2);
auto b1 = _mm_alignr_epi8(a2, a1, 2);
auto b2 = _mm_alignr_epi8(a3, a2, 2);
auto b3 = _mm_alignr_epi8(a4, a3, 2);
// ... do the computation as above ...
w = a4;   // rotate the context

CodePudding user response：

#include <cstdint>
#include <immintrin.h>

template<unsigned Shift>
void foo(uint64_t* __restrict pDst, const uint64_t* __restrict pSrc, intptr_t size)
{
    uint64_t* pSrc0, * pSrc1, * pSrc2, * pSrc3, * pDst0, * pDst1, * pDst2, * pDst3;
    __m256i prev, current;
    intptr_t i, stride;

    stride = size >> 2;
    i = stride;

    pSrc0 = pSrc;
    pSrc1 = pSrc   stride;
    pSrc2 = pSrc   2 * stride;
    pSrc2 = pSrc   3 * stride;

    pDst0 = pDst;
    pDst1 = pDst   stride;
    pDst2 = pDst   2 * stride;
    pDst3 = pDst   3 * stride;

    prev = _mm256_set_epi64x(0, pSrc1[-1], pSrc2[-1], pSrc3[-1]);

    while (i--)
    {
        current = _mm256_set_epi64x(*pSrc0  , *pSrc1  , *pSrc2  , *pSrc3  );
        prev = _mm256_srli_epi64(prev, 64 - Shift);
        prev = _mm256_or_si256(prev, _mm256_slli_epi64(current, Shift));
        *pDst0   = _mm256_extract_epi64(prev, 3);
        *pDst1   = _mm256_extract_epi64(prev, 2);
        *pDst2   = _mm256_extract_epi64(prev, 1);
        *pDst3   = _mm256_extract_epi64(prev, 0);

        prev = current;
    }
}

You can do the operation on up to four 64bit elements at once on AVX2 (up to eight on AVX512)

If size isn't a multiple of four, there will be up to 3 remaining ones to deal with.

PS: Auto vectorization is never a proper solution.

CodePudding user response：

No, you can't

Both NEON and AVX(512) support barrel shift operations up to 64bit elements.

You can however "shift" the whole 128bit vector by n-bytes (8bits) with the instruction ext on NEON and alignr on AVX.

And you should avoid using the vector class for performance since it's nothing else than linked list which is bad for the performance.