Home > Software design >  How to create a left-packed vector of indices of the 0s in one SIMD vector?
How to create a left-packed vector of indices of the 0s in one SIMD vector?

Time:05-04

Please tell me, I can't figure it out myself:

Here I have __m128i SIMD vector - each of the 16 bytes contains the following value:

1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1

Is it possible to somehow transform this vector so that all ones are removed, and the place of zeros is the number of the element in the vector of this zero. That is, like this:

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
                                                            
1   0   1   1   0   1   0   1   1   1   0   1   0   1   0   1
                                                            
    1           4       6               10      12     14   

And finally get a vector with only these values:

1  4  6  10  12  14

What logic can be to obtain such a result? What SIMD instructions should be used?

PS: I'm just starting to learn SIMD - so I don't know much. and I don't understand.

CodePudding user response:

If you have BMI2, use the following version.

__m128i compressZeroIndices_bmi2( __m128i v )
{
    const __m128i zero = _mm_setzero_si128();
    // Replace zeros with 0xFF
    v = _mm_cmpeq_epi8( v, zero );

    // Extract low/high pieces into scalar registers for PEXT instruction
    uint64_t low = (uint64_t)_mm_cvtsi128_si64( v );
    uint64_t high = (uint64_t)_mm_extract_epi64( v, 1 );

    // Count payload bytes in the complete vector
    v = _mm_sub_epi8( zero, v );
    v = _mm_sad_epu8( v, zero );
    v = _mm_add_epi64( v, _mm_srli_si128( v, 8 ) );
    v = _mm_shuffle_epi8( v, zero );
    // Make a mask vector filled with 0 for payload bytes, 0xFF for padding
    const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
    v = _mm_max_epu8( v, identity );
    __m128i mask = _mm_cmpeq_epi8( v, identity );

    // The following line requires C  /20
    // If you don't have it, use #ifdef _MSC_VER to switch between __popcnt64() and _popcnt64() intrinsics.
    uint64_t lowBits = std::popcount( low );
    // Use BMI2 to gather these indices
    low = _pext_u64( 0x0706050403020100ull, low );
    high = _pext_u64( 0x0F0E0D0C0B0A0908ull, high );

    // Merge payload into a vector
    v = _mm_cvtsi64_si128( low | ( high << lowBits ) );
    v = _mm_insert_epi64( v, high >> ( 64 - lowBits ), 1 );

    // Apply the mask to set unused elements to -1, enables pmovmskb   tzcnt to find the length
    return _mm_or_si128( v, mask );
}

Here’s another version without BMI2. Probably slower on most CPUs, but the code is way simpler, and doesn’t use any scalar instructions.

inline __m128i sortStep( __m128i a, __m128i perm, __m128i blend )
{
    // The min/max are independent and their throughput is 0.33-0.5 cycles,
    // so this whole function only takes 3 (AMD) or 4 (Intel) cycles to complete
    __m128i b = _mm_shuffle_epi8( a, perm );
    __m128i i = _mm_min_epu8( a, b );
    __m128i ax = _mm_max_epu8( a, b );
    return _mm_blendv_epi8( i, ax, blend );
}

__m128i compressZeroIndices( __m128i v )
{
    // Replace zeros with 0-based indices, ones with 0xFF
    v = _mm_cmpgt_epi8( v, _mm_setzero_si128() );
    const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
    v = _mm_or_si128( v, identity );

    // Sort bytes in the vector with a network
    // https://demonstrations.wolfram.com/SortingNetworks/
    // Click the "transposition" algorithm on that demo
    const __m128i perm1 = _mm_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 );
    const __m128i blend1 = _mm_set1_epi16( (short)0xFF00 );
    const __m128i perm2 = _mm_setr_epi8( 0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 15 );
    const __m128i blend2 = _mm_setr_epi8( 0, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0 );
    for( size_t i = 0; i < 8; i   )
    {
        v = sortStep( v, perm1, blend1 );
        v = sortStep( v, perm2, blend2 );
    }
    return v;
}

P.S. If you want the length of the output vector, use this function:

uint32_t vectorLength( __m128i v )
{
    uint32_t mask = (uint32_t)_mm_movemask_epi8( v );
    mask |= 0x10000;
    return _tzcnt_u32( mask );
}

CodePudding user response:

Horizontal data-dependent stuff is hard. This is not something traditional SIMD building blocks are good at. This is a tricky problem to start learning SIMD with.

If you had AVX512VBMI2 (Ice Lake), vpcompressb could do this in one instruction on a constant. (Well two, counting the test-into-mask of the input.)
Or with AVX-512BW (Skylake-avx512), you could use vpcompressd on a constant vector of 16x uint32_t and then pack that __m512i down to bytes after compressing with vpmovdb. (After the same test-into-mask of the byte vector).


16 separate elements means a single table-lookup is not viable; 2^16 x __m128i would be 64K x 16-bytes = 1 MiB, most accesses would miss in cache. (The code would be simple though; just _mm_cmpeq_epi8 against zero or _mm_slli_epi32(v, 7) / _mm_movemask_epi8 / use that 16-bit bitmask as an array index).

Possibly 4 lookup of 4-byte chunks using 4 mask bits at a time could work. (With SWAR add of 0x04040404 / 0x08080808 / 0x0c0c0c0c to offset the result). Your table could also store offset values, or you could _lzcnt_u32 or something to figure out how much to increment a pointer until the next store, or _popcnt_u32(zpos&0xf).

#include <stdint.h>
#include <immintrin.h>
#include <stdalign.h>
#include <string.h>

// untested but compiles ok
char *zidx_SSE2(char *outbuf, __m128i v)
{
   alignas(64) static struct __attribute__((packed)) {
       uint32_t idx;
       uint8_t count;  // or make this also uint32_t, but still won't allow a memory-source add unless it's uintptr_t.  Indexing could be cheaper in a PIE though, *8 instead of *5 which needs both base and idx
   }lut[] = { // 16x 5-byte entries
      /*[0b0000]=*/ {0, 0}, /* [0b0001]= */ {0x00000000, 1}, /* [0b0010]= */ {0x00000001, 1 },
      //...  left-packed indices, count of non-zero bits
              /* [0b1111]=*/ {0x03020100, 4}
    };
    // Maybe pack the length into the high 4 bits, and mask?  Maybe not, it's a small LUT

   unsigned zpos = _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
   for (int i=0 ; i<16 ; i =4){
       uint32_t idx = lut[zpos & 0xf].idx;
       idx  = (0x01010101 * i);  // this strength-reduces or is a constant after fully unrolling.  GCC -O2 even realizes it can use add reg, 0x04040404 *as* the loop counter; clang -fno-unroll-loops doesn't
       // idxs from bits 0..3, bits 4..7, bits 8..11, or bits 12..15
       memcpy(outbuf, &idx, sizeof(idx));   // x86 is little-endian.  Aliasing safe unaligned store.
       outbuf  = lut[zpos & 0xf].count;  // or popcount(zpos&0xf)
       zpos >>= 4;
   }
   return outbuf;  // pointer to next byte after the last valid idx
}

https://godbolt.org/z/59Ev1Tz37 shows GCC and clang without loop unrolling. gcc -O3 does fully unroll it, as does clang at -O2 by default.

It will never store more than 16 bytes into outbuf, but stores fewer than that for inputs with fewer zero bytes. (But every store to outbuf is 4 bytes wide, even if there were zero actual indices in this chunk.) If all the input vector bytes are 0, the 4 stores won't overlap at all, otherwise they will (partially or fully) overlap. This is fine; cache and store buffers can absorb this easily.

SIMD vectors are fixed width, so IDK exactly what you meant about your output only having those values. The upper bytes have to be something; if you want zeros, then you could zero the outbuf first. Note that reloading from it into a __m128i vector would cause a store-forwarding stall (extra latency) if done right away after being written by 4x 32-bit stores. That's not a disaster, but it's not great. Best to do this into whatever actual output you want to write in the first place.


BMI2 pext is a horizontal pack operation

You said in comments you want this for an i7 with AVX2.
That also implies you have fast BMI2 pext / pdep (Intel since Haswell, AMD since Zen3.) Earlier AMD support those instructions, but not fast. Those do the bitwise equivalent of vpcompressb / vpexpandb on a uint64_t in an integer register.

This could allow a similar trick to AVX2 what is the most efficient way to pack left based on a mask?
After turning your vector into a mask of 0 / 0xf nibbles, we can extract the corresponding nibbles with values 0..15 into the bottom of an integer register with one pext instruction.

Or maybe keep things packed to bytes at the smallest to avoid having to unpack nibbles back to bytes, so then you'd need two separate 8-byte left-pack operations and need a popcnt or lzcnt to figure out how they should overlap.

Your pext operands would be the 0 / 0xff bytes from a _mm_cmpeq_epi8(v, _mm_setzero_si128()), extracted in two uint64_t halves with lo = _mm_cvtsi128_si64(cmp) and hi = _mm_extract_epi64(cmp, 1)`.

Use memcpy as an unaligned aliasing-safe store, as in the LUT version.

  • Related