Home > Software engineering >  General algorithm for number/bit transformations
General algorithm for number/bit transformations

Time:06-21

Is there a general strategy to create an efficient number/bit transformations algorithm. The goal is to create a fast branch-less and if possible LUT-less algorithm. I'll give an example:

A 13 bit code is to be transformed into another 13 bit code according to the following rule table:

BIT INPUT (DEC) INPUT (BIN) OUTPUT (BIN) OUTPUT (DEC)
0 1 0000000000001 0000100000000 256
1 2 0000000000010 0010000000000 1024
2 4 0000000000100 0100000000000 2048
3 8 0000000001000 1000000000000 4096
4 16 0000000010000 0000001000000 64
5 32 0000000100000 0000000100000 32
6 64 0000001000000 0001000000000 512
7 128 0000010000000 0000000010000 16
8 256 0000100000000 0000000001000 8
9 512 0001000000000 0000000000010 2
10 1024 0010000000000 0000000000001 1
11 2048 0100000000000 0000000000100 4
12 4096 1000000000000 0000010000000 128

Example: If the input code is 1 2 4096=4099 the resulting output would be 256 1024 128=1408

A naive approach would be:

OUTPUT = ((INPUT AND 0000000000001) << 8) OR ((INPUT AND 0000000000010) << 9) OR ((INPUT AND 0000000000100) << 9) OR ((INPUT AND 0000000001000) << 9) OR ...

It means we have 3 instructions per bit (AND, SHIFT, OR) = 39-1 (last OR omitted) instructions for the above example. Instead we could also use a combination of left and right shifts to potentially reduce code size (depends on target platform), but this will not decrease the amount of instructions.

When inspecting the example table, you will of course notice a few obvious possibilities for optimization, for example in line 2/3/4 which can be combined as ((INPUT AND 0000000000111) << 9). But beside that it is becoming a difficult tedious task.

Are the general strategies? I think using Karnaugh-Veitch Map's to simplify the expression could be one approach? However it is pretty difficult for 13 input variables. Also the resulting expression would only be a combination of OR's and AND's.

CodePudding user response:

A pretty fast solution to permute the bits is to locate the position of the bit set in the input number, then use a very-small lookup table so to find the location of the bit to be set in the output number, then to use a shift to actually generate the output value.

The location of the bit set in the input value can be efficiently found using the intrinsic __builtin_ctz(value) (or __builtin_ffs(value)-1) in GCC/Clang/ICC. On MSVC, you can use the intrinsic _BitScanForward (see this post). This intrinsic uses the fast BSF instruction on x86-64 processors and the rbit clz instructions on ARM processors.

The LUT should not be much a problem because it can fit in 1 cache line (64-bit on x86-64 processors). As a result, there is no overhead to pay once the unique cache line is already fetched. In fact, fetching the LUT value should not be more expensive than fetching any constant value from the BSS segment.

Here is an implementation:

int32_t compute(int32_t value)
{
    static const int8_t table[] = {8, 10, 11, 12, 6, 5, 9, 4, 3, 1, 0, 2, 7};
    int32_t result = int32_t(1) << table[__builtin_ctz(value)];
    return result;
}

Generated assembly code

GCC generates the following assembly code:

        movabs  rax, 290769174472100360     ; Cheap
        rep bsf edi, edi
        mov     QWORD PTR [rsp-13], rax
        movsx   rdi, edi                    ; Cheap
        movabs  rax, 504966112564480261     ; Cheap
        mov     QWORD PTR [rsp-8], rax
        movsx   ecx, BYTE PTR [rsp-13 rdi]
        mov     eax, 1                      ; Very cheap if not even free
        sal     eax, cl

Clang generates a better assembly code:

        bsf     eax, edi
        mov     cl, byte ptr [rax   .L__const.test(int).table]
        mov     eax, 1                      ; Very cheap if not even free
        shl     eax, cl

Note that GCC is able to remove the overhead of a cold LUT fetch if the static keyword is removed (Clang does not):

        movabs  rax, 290769174472100360
        rep bsf edi, edi
        mov     QWORD PTR [rsp-13], rax
        movsx   rdi, edi
        movabs  rax, 504966112564480261
        mov     QWORD PTR [rsp-8], rax
        movsx   ecx, BYTE PTR [rsp-13 rdi]
        mov     eax, 1
        sal     eax, cl

This is less efficient than the previous generated assembly codes if the cache is hot, but not if the cache is cold since all load/store are performed in the stack which is generally stored in the cache.

CodePudding user response:

If one big lookup table is too much, you can try to use two smaller ones.

  1. Take 7 lower bits of the input, look up a 16-bit value in table A.

  2. Take 6 higher bits of the input, look up a 16-bit value in table B.

  3. or the values from 1. and 2. to produce the result.

Table A needs 128*2 bytes, table B needs 64*2 bytes, that's 384 bytes for the lookup tables.

CodePudding user response:

For bit permutations, several strategies are known that work in certain cases. There's a code generator at https://programming.sirrida.de/calcperm.php which implements most of them. However, in this case, it seems to find only basically the strategy you suggested, indicating that it seems hard to find any pattern to exploit in this permutation.

CodePudding user response:

This is a hand-optimised multiple LUT solution, which doesn't really prove anything other than that I had some time to burn.

Multiple small lookup tables can occasionally save time and/or space, but I don't know of a strategy to find the optimal combination. In this case, the best division seems to be three LUTs of three bits each (bits 4-6, 7-9 and 10-12), totalling 24 bytes (each table has 8 one-byte entries), plus a simple shift to cover bits through 3, and another simple shift for the remaining bit 0. Bit 5, which is untransformed, was also a tempting target but I don't see a good way to divide bit ranges around it.

The three look-up tables have single-byte entries because the range of the transformations for each range is just one byte. In fact, the transformations for two of the bit ranges fit entirely in the low-order byte, avoiding a shift.

Here's the code:

unsigned short permute_bits(unsigned short x) {
#define LUT3(BIT0, BIT1, BIT2)                                \
{ 0,      (BIT0),        (BIT1),        (BIT1) (BIT0),        \
  (BIT2), (BIT2) (BIT0), (BIT2) (BIT1), (BIT2) (BIT1) (BIT0)}
    static const unsigned char t4[] =  LUT3(1<<(6-3), 1<<(5-3), 1<<(9-3));
    static const unsigned char t7[] =  LUT3(1<<4, 1<<3, 1<<1);
    static const unsigned char t10[] = LUT3(1<<0, 1<<2, 1<<7);
#undef LUT3
    return   (   (x&1)       << 8)    // Bit 0
             (   (x&14)      << 9)    // Bits 1-3, simple shift
             (t4[(x>>4)&7]   << 3)    // Bits 4-6, see below
             (t7[(x>>7)&7]       )    // Bits 7-9, three-bit lookup for LOB
             (t10[(x>>10)&7]     );   // Bits 10-12, ditto
}

Note on bits 4-6

Bit 6 is transformed to position 9, which is outside of the low-order byte. However, bits 4 and 5 are moved to positions 6 and 5, respectively, and the total range of the transformed bits is only 5 bit positions. Several different final shifts are possible, but keeping the shift relatively small provides a tiny improvement on x86 architecture, because it allows the use of LEA to do a simultaneous shift and add. (See the second last instruction in the assembly below.)

The intermediate results are added instead of using boolean OR for the same reason. Since the sets of bits in each intermediate result are disjoint, ADD and OR have the same result; using add can take advantage of chip features like LEA.

Here's the compilation of that function, taken from http://gcc.godbolt.org using gcc 12.1 with -O3:

permute_bits(unsigned short):
        mov     edx, edi
        mov     ecx, edi
        movzx   eax, di
        shr     di, 4
        shr     dx, 7
        shr     cx, 10
        and     edi, 7
        and     edx, 7
        and     ecx, 7
        movzx   ecx, BYTE PTR permute_bits(unsigned short)::t10[rcx]
        movzx   edx, BYTE PTR permute_bits(unsigned short)::t7[rdx]
        add     edx, ecx
        mov     ecx, eax
        sal     eax, 9
        sal     ecx, 8
        and     ax, 7168
        and     cx, 256
        or      eax, ecx
        add     edx, eax
        movzx   eax, BYTE PTR permute_bits(unsigned short)::t4[rdi]
        lea     eax, [rdx rax*8]
        ret

I left out the lookup tables themselves because the assembly produced by GCC isn't very helpful.

I don't know if this is any quicker than @trincot's solution (in a comment); a quick benchmark was inconclusive, but it looked to be a few percent quicker. But it's quite a bit shorter, possibly enough to compensate for the 24 bytes of lookup data.

  • Related