Home > Back-end >  FAST way to determine if first n bits of byte array are "0"
FAST way to determine if first n bits of byte array are "0"

Time:12-16

I have a function that computes the checkLeadingZeroBits as follows:


typedef std::array<uint8_t, 32> SHA256Hash;

bool checkLeadingZeroBits(SHA256Hash& hash, unsigned int challengeSize) {
    int bytes = challengeSize / 8;
    const uint8_t * a = hash.data();
    if (bytes == 0) {
        return a[0]>>(8-challengeSize) == 0;
    } else {
        for (int i = 0; i < bytes; i  ) {
            if (a[i] != 0) return false;
        }
        int remainingBits = challengeSize - 8*bytes;
        if (remainingBits > 0) return a[bytes   1]>>(8-remainingBits) == 0;
        else return true;
    }
    return false;
}

I've also tried the following which is approximately the same run time:

    bool checkLeadingZeroBits(SHA256Hash& hash, unsigned int challengeSize) {
        int bytes = challengeSize / 8;
        const uint8_t * a = hash.data();
        if (bytes == 0) {
            return a[0]>>(8-challengeSize) == 0;
        } else {
            if (memcmp(NULL_SHA256_HASH.data(), a, bytes) != 0) return false;
            int remainingBits = challengeSize - 8*bytes;
            if (remainingBits > 0) return a[bytes   1]>>(8-remainingBits) == 0;
            else return true;
        }
        return false;
    }

I'd like to optimize this function to run as quickly as possible. Is there a simpler way to check the leading order bits than using the for loops shown in my implementation?

CodePudding user response:

As written, no. A raw byte array can't be scanned at the bit level faster than a byte by byte comparison operation.

However, if you're targeting a platform with a count leading zeros instruction or equivalent, you can speed up the process by checking blocks the size of a native processor word (32 bit for x86/ARM, 64 bit for AMD64/AA64). This would give you a slight improvement, but overall is likely not significant enough to be useful in this instance.

To do this, you would need to transform the data into a C pointer, then cast that to a pointer the same size as a word on the machine (uintptr_t*). From there, you could read out blocks of memory, and pass them to either the compiler intrinsic that would count the leading zeros of that word.

There are a few caveats to this. For one, you need to be very careful that your last read from the new pointer doesn't overstep into memory you don't own. Additionally, for x86 architectures prior to the introduction of lzcnt (before AMD ABM or Intel Haswell), you need to handle the case where the bsr (bit scan reverse, MSB to LSB) instruction finds no bits, which sets the zero flag instead of returning a real result (this will probably be invisible to you, or automatically transformed by the intrinsic, but I know MSVC's _BitScanReverse function returns true or false based on whether the instruction found a bit or not).

Ultimately, you're likely going to gain a marginal improvement in performance, which will likely be counterbalanced by the memory accesses you'll need to do to read data out of the buffer. Overall, it's not really worth it - the compiler will do a much better job of optimizing the loops than you will with all the instructions in the world. Better to write code that's clear and easily maintained than code that's fragile, obtuse, and reliant on processor specific instructions.

Which is a shame because I really like the lzcnt, bsr/bsf, and clz instructions. They can make scanning bitmaps significantly faster, but they aren't really useful here.

CodePudding user response:

There are SSE4.2 instructions that can be used to mass-compare bytes and speed up this algorithm. You can compare 16-bytes at a time with one instruction.

https://github.com/WojciechMula/sse4-strstr/blob/master/sse4.2-strstr.cpp

CodePudding user response:

One of the issues here is that the char array seems to be in big endian format. This can be mitigated in some architectures by byte swap and in some other architectures by reading 64 bits anyway but concentrating on the LSB bits instead.

 //     MSB
 // x = 00 00 00 00 08 FE BB AB <-- as stored byte wise in memory
 // y = AB BB FE 08 00 00 00 00 <-- as read from memory in x86

 bool check(uint8_t const *x, int challenge_count) {
     uint64_t y;
     std::memcpy(&y, x, 8);
     if (y & (~0ull << (challenge_count & ~7))) {
     
     }
     ... check the remaining byte (or nibble)
 }

Here if you need to perform the same comparison many times, it makes sense to cache (~0ull << (challenge_count & ~7)). Also here the handling of challenge_count >= 64 is omitted, which can be implemented by memcpy followed by comparison to zero.

  • Related