C Loop vectorization - counting matches of 7-byte records with masking-CodePudding

I have a fairly simple loop:


auto indexRecord = getRowPointer(0);
bool equals;
// recordCount is about 6 000 000
for (int i = 0; i < recordCount;   i) {
    equals = BitString::equals(SelectMask, indexRecord, maxBytesValue);
    rowsFound  = equals;
    indexRecord  = byteSize; // byteSize is 7
}

Where BitString::equals is:

static inline bool equals(const char * mask, const char * record, uint64_t maxVal) {
    return !(((*( uint64_t * ) mask) & (maxVal & *( uint64_t * ) record)) ^ (maxVal & *( uint64_t * ) record));
}

This code is used to simulate a Bitmap Index querying in databases. My question is, if there's a way to vectorize the loop, going through all the records. When trying to compile with GCC and -fopt-info-vec-missed -O3 I am getting: missed: couldn't vectorize loop.

I am new to this kind of optimizations and would like to learn more, it just feels like I am missing something.

CodePudding user response：

If the record size was 8, both GCC and Clang would autovectorize, for example: (hopefully a sufficiently representative stand-in for your actual context in which the code occurs)

int count(char * indexRecord, const char * SelectMask, uint64_t maxVal)
{
    bool equals;
    uint64_t rowsFound = 0;
    // some arbitrary number of records
    for (int i = 0; i < 1000000;   i) {
        equals = tequals(SelectMask, indexRecord, maxVal);
        rowsFound  = equals;
        indexRecord  = 8; // record size padded out to 8
    }
    return rowsFound;
}

The important part of it, as compiled by GCC, looks like this:

.L4:
    vpand   ymm0, ymm2, YMMWORD PTR [rdi]
    add     rdi, 32
    vpcmpeqq        ymm0, ymm0, ymm3
    vpsubq  ymm1, ymm1, ymm0
    cmp     rax, rdi
    jne     .L4

Not bad. It uses the same ideas that I would used manually: vpand the data with a mask (simplification of your bitwise logic), compare it to zero, subtract the results of the comparisons (subtract because a True result is indicated with -1) from 4 counters packed in a vector. The four separate counts are added after the loop.

By the way, note that I made rowsFound an uint64_t. That's important. If rowsFound is not 64-bit, then both Clang and GCC will try very hard to narrow the count ASAP, which is exactly the opposite of a good approach: that costs many more instructions in the loop, and has no benefit. If the count is intended to be a 32-bit int in the end, it can simply be narrowed after the loop, where it is probably not merely cheap but actually free to do that.

Something equivalent to that code would not be difficult to write manually with SIMD intrinsics, that could make the code less brittle (it wouldn't be based on hoping that compilers will do the right thing), but it wouldn't work for non-x86 platforms anymore.

If the records are supposed to be 7-byte, that's a more annoying problem to deal with. GCC gives up, Clang actually goes ahead with its auto-vectorization, but it's not good: the 8-byte loads are all done individually, the results then put together in a vector, which is all a big waste of time.

When doing it manually with SIMD intrinsics, the main problems would be unpacking the 7-byte records into qword lanes. An SSE4.1 version could use pshufb (pshufb is from SSSE3, but pcmpeqq is from SSE4.1 so it makes sense to target SSE4.1) to do this, easy. An AVX2 version could do two loads, combine them, and then vpshufb. That's not as nice as having just one load, but anything else I could think of seemed worse, and I think it's a bit better than loading each record individually. Unfortunately the vpshufb that AVX2 has acts like two 128-bit pshufb's side-by-side, so doing one load and one vpshufb does not work, because there would be a record that crosses across the 128-bit halves and vpshufb cannot move data across those halves. AVX512VBMI on the other hand does have a "full" byte permute, so the AVX512VBMI version could be nicer.

For example, an AVX2 version with manual vectorization and 7-byte records could look something like this. BTW this requires a bit more padding past the end of the data than the original code (which already requires one byte of padding past the end). Not tested, but it would at least give you some idea of how code with manual vectorization would work.

int count(char * indexRecord, uint64_t SelectMask, uint64_t maxVal)
{
    __m256i mask = _mm256_set1_epi64x(~SelectMask & maxVal);
    __m256i count = _mm256_setzero_si256();
    __m256i zero = _mm256_setzero_si256();
    __m256i shufmask = _mm256_broadcastsi128_si256(_mm_setr_epi8(0, 1, 2, 3, 4, 5, 6, -1, 7, 8, 9, 10, 11, 12, 13, -1));
    for (int i = 0; i < 1000000;   i) {
        __m128i recordsA = _mm_loadu_si128((__m128i*)indexRecord);
        indexRecord  = 7 * 2;
        __m128i recordsB = _mm_loadu_si128((__m128i*)indexRecord);
        indexRecord  = 7 * 2;
        __m256i records = _mm256_inserti128_si256(_mm256_castsi128_si256(recordsA), recordsB, 1);
        records = _mm256_shuffle_epi8(records, shufmask);
        __m256i isZero = _mm256_cmpeq_epi64(_mm256_and_si256(records, mask), zero);
        count = _mm256_sub_epi64(count, isZero);
    }
    __m128i countA = _mm256_castsi256_si128(count);
    __m128i countB = _mm256_extracti128_si256(count, 1);
    countA = _mm_add_epi64(countA, countB);
    return _mm_cvtsi128_si64(countA)   _mm_extract_epi64(countA, 1);
}

CodePudding user response：

First, your code is not a complete example. You're missing definitions and types of many variables, which makes it difficult to answer. You also did not indicate which platform you're compiling on/for.

Here are reasons why vectorization might fail:

Your reads are overlapping! you're reading 8 bytes at 7-byte intervals. That alone might confuse the vectorization logic.
Your pointers may not be __restrict'ed, meaning that the compiler must assume they might alias, meaning that it might need to reread from the address on every access.
Your equals() function pointer parameters are definitely not __restrict'ed (although the compiler could be seeing through that with inlining).
Alignment. On x86_64 this does (???) not matter, but on some platforms, some larger instructions need to know they work on properly aligned places in memory.
Why don't you put *SelectMask in a local variable?