Expanding 5-bit bitfields from a u32 into a u8[6] buffer, the most efficient way possible-CodePudding

This is an optimization problem. I want to copy a bitfield of six 5-bit elements to a u8 buffer, naively done like this:

void Expand(u32 x, u8 b[6]) {
    b[0] = (x >> 0) & 31;
    b[1] = (x >> 5) & 31;
    b[2] = (x >> 10) & 31;
    b[3] = (x >> 15) & 31;
    b[4] = (x >> 20) & 31;
    b[5] = (x >> 25) & 31;
}

This is the assembly generated by x86 msvc v19.latest, flags /O2 /Ot /Gr, gcc and clang will give approximately the same thing.

@Expand@8 PROC
        mov     al, cl
        and     al, 31
        mov     BYTE PTR [edx], al
        mov     eax, ecx
        shr     eax, 5
        and     al, 31
        mov     BYTE PTR [edx 1], al
        mov     eax, ecx
        shr     eax, 10
        and     al, 31
        mov     BYTE PTR [edx 2], al
        mov     eax, ecx
        shr     eax, 15
        and     al, 31
        mov     BYTE PTR [edx 3], al
        mov     eax, ecx
        shr     eax, 20
        shr     ecx, 25
        and     al, 31
        and     cl, 31
        mov     BYTE PTR [edx 4], al
        mov     BYTE PTR [edx 5], cl
        ret     0
@Expand@8 ENDP

But I just don't like it; I know it does exactly what it should be doing, it just seems to me that it could be a lot more efficient.
To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.

                  11111 11111 11111 11111 11111 11111
                                                    ↓
00011111 00011111 00011111 00011111 00011111 00011111

I've been trying SHIFTing, ORing, only ANDing at the end with a u64 (0x1f1f1f1f1f1f), but I remain unsuccessful in my optimization efforts. I'm confident this should be doable in less than 10 instructions, any guidance would be appreciated.

EDIT
I scratched my head a little bit more and this is the best I could come up with so far:

void Expand(u32 x, u8 b[6]) {
    memset(b, 31, 6);
    b[0] &= x;
    b[1] &= x >>= 5;
    b[2] &= x >>= 5;
    b[3] &= x >>= 5;
    b[4] &= x >>= 5;
    b[5] &= x >>= 5;
}

Compiles to:

@Expand@8 PROC
        mov     eax, 0x1f1f1f1f
        mov     DWORD PTR [edx], eax
        mov     WORD PTR [edx 4], ax
        and     BYTE PTR [edx], cl
        shr     ecx, 5
        and     BYTE PTR [edx 1], cl
        shr     ecx, 5
        and     BYTE PTR [edx 2], cl
        shr     ecx, 5
        and     BYTE PTR [edx 3], cl
        shr     ecx, 5
        and     BYTE PTR [edx 4], cl
        shr     ecx, 5
        and     BYTE PTR [edx 5], cl
        ret     0
@Expand@8 ENDP

CodePudding user response：

Here's a cross-platform solution that only needs a fast multiplier that's available on almost all desktop architectures

void Expand(uint32_t x, uint8_t b[6]) {
    uint32_t x024 = x & 0b00'00000'11111'00000'11111'00000'11111;
    uint32_t x135 = x & 0b00'11111'00000'11111'00000'11111'00000;
    uint64_t r024 = x024 * 0x0100'0000'4000'0010ULL & 0x1F001F001F000000;
    uint64_t r135 = x135 * 0x0040'0000'1000'0004ULL & 0x001F001F001F0000;
    uint64_t result = r024 | (r135 >> 11);
#if !BIG_ENDIAN
    result = htonll(result);
#endif
    memcpy(b, &result, 6);
}

See below for the detailed math. It needs ~8-9 operations and run in 2 parallel chains. You can improve that by passing an 8-byte array instead of 6 and restoring the last 2 elements b[6]/b[7] later if necessary.

But you should really use #ifdef and provide efficient implementations for each supported platform and a fallback generic solution like the above for other platforms. The fastest way on x86 is SIMD or PDEP depending on whether you do this for a big array or just do it sporadically. All other platforms also have their own SIMD that can be used to accelerate this. Alternatively you use platform-agnostic SIMD libraries to automatically emit efficient SIMD code for any architectures, too.

Note that instruction count is not a measure for performance. Not all instructions are equal. Your "best" is actually more terrible than the first version because it has a long dependency chain whereas the CPU can start 5 independent executions at the same time and run in parallel with the latter

Remember many instructions are slow so multiple simpler equivalent instructions would be faster. Multiple instructions that can be executed in parallel would also be faster than a shorter sequence that have dependencies. And a short loop is also worse than a straightforward run

The math behind the algorithm

Let the input be 32 bits of 00aaaaabbbbbcccccdddddeeeeefffff. The multiplication will produce the bits in the correct position after masking

                                  0000000bbbbb00000ddddd00000fffff (x024)
× 0000000100000000000000000000000001000000000000000000000000010000 (0x0100'0000'4000'0010)
  ────────────────────────────────────────────────────────────────
                              0000000bbbbb00000ddddd00000fffff
    0000000bbbbb00000ddddd00000fffff
  000fffff
  0000000100000000000000000000000001000000000000000000000000010000
  ────────────────────────────────────────────────────────────────
& 0001111100000000000111110000000000011111000000000000000000000000 (0x1F001F001F000000)
  ────────────────────────────────────────────────────────────────
= 000fffff00000000000ddddd00000000000bbbbb000000000000000000000000

                                  00aaaaa00000ccccc00000eeeee00000 (x135)
× 0000000001000000000000000000000000010000000000000000000000000100 (0x0040'0000'1000'0004)
  ────────────────────────────────────────────────────────────────
                                00aaaaa00000ccccc00000eeeee00000
      00aaaaa00000ccccc00000eeeee00000
  eeeee00000
  ────────────────────────────────────────────────────────────────
& 11111000000000001111100000000000111110000000000000000            (0x001F001F001F0000)
  ────────────────────────────────────────────────────────────────
= eeeee00000000000ccccc00000000000aaaaa000000000000000000000000000

Merging the above two results we have 000fffff000eeeee000ddddd000ccccc000bbbbb000aaaaa0000000000000000 which will contain the expected bytes in the correct order when storing as big endian in memory

Output assembly for comparison

For more details about the algorithm see How to create a byte out of 8 bool values (and vice versa)?

CodePudding user response：

To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.

I can see why you say that, but inasmuch as you also say you want to be architecture agnostic, it's not quite right. There is an important but somewhat subtle distinction between your bitfield representation and your byte-array representation: bitfield identity / index within the packed number is a function of value whereas byte identity / index within the array is a function of storage order. Thus, it would be closer to say that you want to convert a 30-bit native-format number to a 48-bit little-endian number.

One way to accomplish that is simply by reading out the number into the destination array, which is exactly what the two alternatives presented in the question do. In that sense, you're already doing it. But if you imagine arithmetically expanding the number as a separate step then you have to acknowledge that you need afterwards to store it in the array. I suppose you have in mind to memcpy() it into place, but note well that for that arithmetic memcpy() to make sense as an alternative to direct read-out, the arithmetic has to be arch-sensitive.

But let's explore the arithmetic a bit anyway. After all, maybe you don't care about non-little-endian architectures. Given your comment that ...

Ideally I'd like this to just be "C code that compiles well on evey arch", so no weird instructions or hidden intrinsics

..., I'll consider only the operations provided by standard C. The objective is to compute a 64-bit integer containing the wanted expansion in its least-significant 48 bits.

A key limitation here is that each bitfield of the input has to be shifted a different distance. The most straightforward approach is to do it field by field, maybe something like this:

    uint64_t expanded =
        ( (uint64_t) x)       &           0x1f)  
        (((uint64_t) x << 3)  &         0x1f00)  
        (((uint64_t) x << 6)  &       0x1f0000)  
        (((uint64_t) x << 9)  &     0x1f000000)  
        (((uint64_t) x << 12) &   0x1f00000000)  
        (((uint64_t) x << 15) & 0x1f0000000000);

Or perhaps a compiler would treat this variation more kindly:

    uint64_t temp = x;
    uint64_t expanded = temp &   0x1f;
    temp <<= 3;
    expanded |= temp &         0x1f00;
    temp <<= 3;
    expanded |= temp &       0x1f0000;
    temp <<= 3;
    expanded |= temp &     0x1f000000;
    temp <<= 3;
    expanded |= temp &   0x1f00000000;
    temp <<= 3;
    expanded |= temp & 0x1f0000000000;

But of course, both of those perform more arithmetic operations than your alternatives do, so there is little reason to expect simpler assembly code. Nevertheless, you might see an overall performance improvement arising from fewer accesses to memory to store the result (assuming use of memcpy(); not shown, assumed to be optimized to avoid an actual function call).

Perhaps you were looking for a bit-twiddling hack to get more work out of fewer arithmetic operations. The scope for that is small, because you have only six fields to operate upon in the first place. The only way to factor that other than 6 (x 1) is 3 x 2, and with the latter you must expect at best 3 2 - 1 = 4 sets of operations instead of 6 sets. Something like this:

uint64_t temp = (((uint64_t) x << 9) & 0xffffff000000) | (x & 0x7fff);
uint64_t expanded = temp &     0x1f00001f;
temp <<= 3;
expanded |=         temp &   0x1f00001f00;
temp <<= 3;
expanded |=         temp & 0x1f00001f0000;

That yields a moderate gain in terms of arithmetic operation count relative to the straightforward versions: three shifts instead of five, five ands instead of six, and three ors instead of five. The compiler might or might not treat this code as kindly as it does a different one. And it's still more arithmetic operations than your direct read-out alternatives.

CodePudding user response：

Your version has 6 shifts (5 really) and 6 maskings (and 6 assignment, naturally.)

I'd suggested "divide & conquer" in the comments.

This version has 4 shifts and 4 masks. It could probably be tidied up, and I've no idea what the assembly for it looks like. Would be interesting to see!

Anyway...

void expand( uint32_t x, uint8_t b[ 6 ] ) {
    union {
        uint32_t val32;
        struct {
            uint16_t lo;
            uint16_t hi;
        } b15;
    } v, valt;

    v.val32 = x << 1; // to shove b15 into b16
    v.b15.lo = (uint16_t)(x & 0xFFFF);

    valt.val32 = (v.val32 >> 0) & 0x001F001F;
    b[0] = (uint8_t)valt.b15.lo;
    b[3] = (uint8_t)valt.b15.hi;

    valt.val32 = (v.val32 >> 5) & 0x001F001F;
    b[1] = (uint8_t)valt.b15.lo;
    b[4] = (uint8_t)valt.b15.hi;

    valt.val32 = (v.val32 >>10) & 0x001F001F;
    b[2] = (uint8_t)valt.b15.lo;
    b[5] = (uint8_t)valt.b15.hi;
}

int main() {
    uint8_t b[ 6 ] = { 7, 24, 31, 0, 6, 1, }; // 0 - 31 only!!

    uint32_t x = (b[5]<<25) | (b[4]<<20) | (b[3]<<15) | (b[2]<<10) | (b[1]<<5) | (b[0]<<0);

    memset( b, 0, sizeof b );

    expand( x, b );

    for( int i = 0; i < 6; i   )
        printf( "[%d] %u  ", i, b[i] );
    puts( "" );

    return 0;
}

Output

[0] 7  [1] 24  [2] 31  [3] 0  [4] 6  [5] 1

EDIT

Some people don't know how to express appreciation for offered help. To be clear, it is not my project. I wrote that I didn't bother looking at the assembly output. The criticism in the comment below seems excessive, imo.

So, having nothing else on, here's a "readable" revision that works in 32 bit architectures. "memory accesses" are an unavoidable result of not having a 64bit register with which to work.

void expand( uint32_t x, uint8_t b[ 6 ] ) {
    union {
        uint32_t x;
        struct { uint8_t l, lx, h, hx; } q;
    } u;

    x = ((x << 1)&0xffff0000)|(x & 0x0000ffff);

    u.x = (x >> 0) & 0x001f001f;
    b[0] = u.q.l;
    b[3] = u.q.h;

    u.x = (x >> 5) & 0x001f001f;
    b[1] = u.q.l;
    b[4] = u.q.h;

    u.x = (x >>10) & 0x001f001f;
    b[2] = u.q.l;
    b[5] = u.q.h;
}