When combining four 1-byte into one 4-byte word, which is much faster?-CodePudding

So I'm currently studying bit-wise operators and bit-manipulation, and I have come across two different ways to combine four 1-byte words into one 4-byte wide word.

the two ways are given below

After finding out this two methods I compare the disassembly code generated by the two (compiled using gcc 11 with -O2 flag), I don't have the basic knowledge with disassembly and with the code it generates, and what I only know is the shorter the code, the faster the function is (most of the time I guess... maybe there are some exceptions), now for the two methods it seems that they have the same numbers/counts of lines in the generated disassembly code, so I guess they have the same performance?

I also got curious about the order of the instructions, the first method seems to alternate other instructions sal>or>sal>or>sal>or, while the second one is more uniform sal>sal>sal>or>or>mov>or does this have some significant effect in the performance say for example if we are dealing with a larger word?

Two methods

int method1(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    int combine = 0;
    combine = byte4;
    combine <<=8;
    combine |= byte3;
    combine <<=8;
    combine |= byte2;
    combine <<=8;
    combine |= byte1;
    return combine;
}

int method2(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    int combine = 0, temp;
    temp = byte4;
    temp <<= 24;
    combine |= temp;
    temp = byte3;
    temp <<= 16;
    combine |= temp;
    temp = byte2;
    temp <<= 8;
    combine |= temp;
    temp = byte1;
    combine |= temp;
    return combine;
}

Disassembly

// method1(unsigned char, unsigned char, unsigned char, unsigned char):
        movzx   edi, dil
        movzx   esi, sil
        movzx   edx, dl
        movzx   eax, cl
        sal     edi, 8
        or      esi, edi
        sal     esi, 8
        or      edx, esi
        sal     edx, 8
        or      eax, edx
        ret
// method2(unsigned char, unsigned char, unsigned char, unsigned char):
        movzx   edx, dl
        movzx   ecx, cl
        movzx   esi, sil
        sal     edi, 24
        sal     edx, 8
        sal     esi, 16
        or      edx, ecx
        or      edx, esi
        mov     eax, edx
        or      eax, edi
        ret

This might be "premature optimization", but I just want to know if there is a difference.

CodePudding user response：

Without optimizations, method 2 seems to be a tiny bit faster.

This is an online benchmark of the code provided: https://quick-bench.com/q/eyiiXkxYVyoogefHZZMH_IoeJss

However, it is difficult to get accurate metrics from tiny operations like this. It also depends on the speed of the instructions on a given CPU (different instructions can take more or less clock cycles).

Furthermore, bitwise operators tend to be pretty fast since they are "basic" instructions.

So the real answer, would be to benchmark it on your target machine to see which one is faster.

CodePudding user response：

I completely agree with Émerick Poulin:

So the real answer, would be to benchmark it on your target machine to see which one is faster.

Nevertheless, I created a "method3()", and disassembled all three with gcc version 10.3.0, with -O0 and -O2, Here's a summary of the -O2 results:

Method3:

int method3(unsigned char byte4, unsigned char byte3, unsigned char byte2, unsigned char byte1)
{
    int combine = (byte4 << 24)|(byte3<<16)|(byte2<<8)|byte1;
    return combine;
}

gcc -O2 -S:

;method1:
    sall    $8,