This is an optimization problem. I want to copy a bitfield of six 5-bit elements to a u8 buffer, naively done like this:
void Expand(u32 x, u8 b[6]) {
b[0] = (x >> 0) & 31;
b[1] = (x >> 5) & 31;
b[2] = (x >> 10) & 31;
b[3] = (x >> 15) & 31;
b[4] = (x >> 20) & 31;
b[5] = (x >> 25) & 31;
}
This is the assembly generated by x86 msvc v19.latest
, flags /O2 /Ot /Gr
, gcc and clang will give approximately the same thing.
@Expand@8 PROC
mov al, cl
and al, 31
mov BYTE PTR [edx], al
mov eax, ecx
shr eax, 5
and al, 31
mov BYTE PTR [edx 1], al
mov eax, ecx
shr eax, 10
and al, 31
mov BYTE PTR [edx 2], al
mov eax, ecx
shr eax, 15
and al, 31
mov BYTE PTR [edx 3], al
mov eax, ecx
shr eax, 20
shr ecx, 25
and al, 31
and cl, 31
mov BYTE PTR [edx 4], al
mov BYTE PTR [edx 5], cl
ret 0
@Expand@8 ENDP
But I just don't like it; I know it does exactly what it should be doing, it just seems to me that it could be a lot more efficient.
To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.
11111 11111 11111 11111 11111 11111
↓
00011111 00011111 00011111 00011111 00011111 00011111
I've been trying SHIFTing, ORing, only ANDing at the end with a u64 (0x1f1f1f1f1f1f
), but I remain unsuccessful in my optimization efforts. I'm confident this should be doable in less than 10 instructions, any guidance would be appreciated.
EDIT
I scratched my head a little bit more and this is the best I could come up with so far:
void Expand(u32 x, u8 b[6]) {
memset(b, 31, 6);
b[0] &= x;
b[1] &= x >>= 5;
b[2] &= x >>= 5;
b[3] &= x >>= 5;
b[4] &= x >>= 5;
b[5] &= x >>= 5;
}
Compiles to:
@Expand@8 PROC
mov eax, 0x1f1f1f1f
mov DWORD PTR [edx], eax
mov WORD PTR [edx 4], ax
and BYTE PTR [edx], cl
shr ecx, 5
and BYTE PTR [edx 1], cl
shr ecx, 5
and BYTE PTR [edx 2], cl
shr ecx, 5
and BYTE PTR [edx 3], cl
shr ecx, 5
and BYTE PTR [edx 4], cl
shr ecx, 5
and BYTE PTR [edx 5], cl
ret 0
@Expand@8 ENDP
CodePudding user response:
Here's a cross-platform solution that only needs a fast multiplier that's available on almost all desktop architectures
void Expand(uint32_t x, uint8_t b[6]) {
uint32_t x024 = x & 0b00'00000'11111'00000'11111'00000'11111;
uint32_t x135 = x & 0b00'11111'00000'11111'00000'11111'00000;
uint64_t r024 = x024 * 0x0100'0000'4000'0010ULL & 0x1F001F001F000000;
uint64_t r135 = x135 * 0x0040'0000'1000'0004ULL & 0x001F001F001F0000;
uint64_t result = r024 | (r135 >> 11);
#if !BIG_ENDIAN
result = htonll(result);
#endif
memcpy(b, &result, 6);
}
See below for the detailed math. It needs ~8-9 operations and run in 2 parallel chains. You can improve that by passing an 8-byte array instead of 6 and restoring the last 2 elements b[6]
/b[7]
later if necessary.
But you should really use #ifdef
and provide efficient implementations for each supported platform and a fallback generic solution like the above for other platforms. The fastest way on x86 is SIMD or PDEP depending on whether you do this for a big array or just do it sporadically. All other platforms also have their own SIMD that can be used to accelerate this. Alternatively you use platform-agnostic SIMD libraries to automatically emit efficient SIMD code for any architectures, too.
Note that instruction count is not a measure for performance. Not all instructions are equal. Your "best" is actually more terrible than the first version because it has a long dependency chain whereas the CPU can start 5 independent executions at the same time and run in parallel with the latter
Remember many instructions are slow so multiple simpler equivalent instructions would be faster. Multiple instructions that can be executed in parallel would also be faster than a shorter sequence that have dependencies. And a short loop is also worse than a straightforward run
The math behind the algorithm
Let the input be 32 bits of 00aaaaabbbbbcccccdddddeeeeefffff
. The multiplication will produce the bits in the correct position after masking
0000000bbbbb00000ddddd00000fffff (x024)
× 0000000100000000000000000000000001000000000000000000000000010000 (0x0100'0000'4000'0010)
────────────────────────────────────────────────────────────────
0000000bbbbb00000ddddd00000fffff
0000000bbbbb00000ddddd00000fffff
000fffff
0000000100000000000000000000000001000000000000000000000000010000
────────────────────────────────────────────────────────────────
& 0001111100000000000111110000000000011111000000000000000000000000 (0x1F001F001F000000)
────────────────────────────────────────────────────────────────
= 000fffff00000000000ddddd00000000000bbbbb000000000000000000000000
00aaaaa00000ccccc00000eeeee00000 (x135)
× 0000000001000000000000000000000000010000000000000000000000000100 (0x0040'0000'1000'0004)
────────────────────────────────────────────────────────────────
00aaaaa00000ccccc00000eeeee00000
00aaaaa00000ccccc00000eeeee00000
eeeee00000
────────────────────────────────────────────────────────────────
& 11111000000000001111100000000000111110000000000000000 (0x001F001F001F0000)
────────────────────────────────────────────────────────────────
= eeeee00000000000ccccc00000000000aaaaa000000000000000000000000000
Merging the above two results we have 000fffff000eeeee000ddddd000ccccc000bbbbb000aaaaa0000000000000000
which will contain the expected bytes in the correct order when storing as big endian in memory
Output assembly for comparison
For more details about the algorithm see How to create a byte out of 8 bool values (and vice versa)?
CodePudding user response:
To me it looks like a 30-bit number that needs to be scaled up to a 48-bit number while inserting zeroes.
I can see why you say that, but inasmuch as you also say you want to be architecture agnostic, it's not quite right. There is an important but somewhat subtle distinction between your bitfield representation and your byte-array representation: bitfield identity / index within the packed number is a function of value whereas byte identity / index within the array is a function of storage order. Thus, it would be closer to say that you want to convert a 30-bit native-format number to a 48-bit little-endian number.
One way to accomplish that is simply by reading out the number into the destination array, which is exactly what the two alternatives presented in the question do. In that sense, you're already doing it. But if you imagine arithmetically expanding the number as a separate step then you have to acknowledge that you need afterwards to store it in the array. I suppose you have in mind to memcpy()
it into place, but note well that for that arithmetic memcpy()
to make sense as an alternative to direct read-out, the arithmetic has to be arch-sensitive.
But let's explore the arithmetic a bit anyway. After all, maybe you don't care about non-little-endian architectures. Given your comment that ...
Ideally I'd like this to just be "C code that compiles well on evey arch", so no weird instructions or hidden intrinsics
..., I'll consider only the operations provided by standard C. The objective is to compute a 64-bit integer containing the wanted expansion in its least-significant 48 bits.
A key limitation here is that each bitfield of the input has to be shifted a different distance. The most straightforward approach is to do it field by field, maybe something like this:
uint64_t expanded =
( (uint64_t) x) & 0x1f)
(((uint64_t) x << 3) & 0x1f00)
(((uint64_t) x << 6) & 0x1f0000)
(((uint64_t) x << 9) & 0x1f000000)
(((uint64_t) x << 12) & 0x1f00000000)
(((uint64_t) x << 15) & 0x1f0000000000);
Or perhaps a compiler would treat this variation more kindly:
uint64_t temp = x;
uint64_t expanded = temp & 0x1f;
temp <<= 3;
expanded |= temp & 0x1f00;
temp <<= 3;
expanded |= temp & 0x1f0000;
temp <<= 3;
expanded |= temp & 0x1f000000;
temp <<= 3;
expanded |= temp & 0x1f00000000;
temp <<= 3;
expanded |= temp & 0x1f0000000000;
But of course, both of those perform more arithmetic operations than your alternatives do, so there is little reason to expect simpler assembly code. Nevertheless, you might see an overall performance improvement arising from fewer accesses to memory to store the result (assuming use of memcpy()
; not shown, assumed to be optimized to avoid an actual function call).
Perhaps you were looking for a bit-twiddling hack to get more work out of fewer arithmetic operations. The scope for that is small, because you have only six fields to operate upon in the first place. The only way to factor that other than 6 (x 1) is 3 x 2, and with the latter you must expect at best 3 2 - 1 = 4 sets of operations instead of 6 sets. Something like this:
uint64_t temp = (((uint64_t) x << 9) & 0xffffff000000) | (x & 0x7fff);
uint64_t expanded = temp & 0x1f00001f;
temp <<= 3;
expanded |= temp & 0x1f00001f00;
temp <<= 3;
expanded |= temp & 0x1f00001f0000;
That yields a moderate gain in terms of arithmetic operation count relative to the straightforward versions: three shifts instead of five, five ands instead of six, and three ors instead of five. The compiler might or might not treat this code as kindly as it does a different one. And it's still more arithmetic operations than your direct read-out alternatives.
CodePudding user response:
Your version has 6 shifts (5 really) and 6 maskings (and 6 assignment, naturally.)
I'd suggested "divide & conquer" in the comments.
This version has 4 shifts and 4 masks. It could probably be tidied up, and I've no idea what the assembly for it looks like. Would be interesting to see!
Anyway...
void expand( uint32_t x, uint8_t b[ 6 ] ) {
union {
uint32_t val32;
struct {
uint16_t lo;
uint16_t hi;
} b15;
} v, valt;
v.val32 = x << 1; // to shove b15 into b16
v.b15.lo = (uint16_t)(x & 0xFFFF);
valt.val32 = (v.val32 >> 0) & 0x001F001F;
b[0] = (uint8_t)valt.b15.lo;
b[3] = (uint8_t)valt.b15.hi;
valt.val32 = (v.val32 >> 5) & 0x001F001F;
b[1] = (uint8_t)valt.b15.lo;
b[4] = (uint8_t)valt.b15.hi;
valt.val32 = (v.val32 >>10) & 0x001F001F;
b[2] = (uint8_t)valt.b15.lo;
b[5] = (uint8_t)valt.b15.hi;
}
int main() {
uint8_t b[ 6 ] = { 7, 24, 31, 0, 6, 1, }; // 0 - 31 only!!
uint32_t x = (b[5]<<25) | (b[4]<<20) | (b[3]<<15) | (b[2]<<10) | (b[1]<<5) | (b[0]<<0);
memset( b, 0, sizeof b );
expand( x, b );
for( int i = 0; i < 6; i )
printf( "[%d] %u ", i, b[i] );
puts( "" );
return 0;
}
Output
[0] 7 [1] 24 [2] 31 [3] 0 [4] 6 [5] 1
EDIT
Some people don't know how to express appreciation for offered help. To be clear, it is not my project. I wrote that I didn't bother looking at the assembly output. The criticism in the comment below seems excessive, imo.
So, having nothing else on, here's a "readable" revision that works in 32 bit architectures. "memory accesses" are an unavoidable result of not having a 64bit register with which to work.
void expand( uint32_t x, uint8_t b[ 6 ] ) {
union {
uint32_t x;
struct { uint8_t l, lx, h, hx; } q;
} u;
x = ((x << 1)&0xffff0000)|(x & 0x0000ffff);
u.x = (x >> 0) & 0x001f001f;
b[0] = u.q.l;
b[3] = u.q.h;
u.x = (x >> 5) & 0x001f001f;
b[1] = u.q.l;
b[4] = u.q.h;
u.x = (x >>10) & 0x001f001f;
b[2] = u.q.l;
b[5] = u.q.h;
}