Which is faster, a struct, or a primitive variable containing the same bytes?-CodePudding

Here is an example piece of code:

#include <stdint.h> 
#include <iostream>

typedef struct {
    uint16_t low;
    uint16_t high;
} __attribute__((packed)) A;

typedef uint32_t B;

int main() {
    //simply to make the answer unknowable at compile time
    uint16_t input;
    cin >> input;
    A a = {15,input};
    B b = 0x000f0000   input;
    //a equals b
    int resultA = a.low-a.high;
    int resultB = b&0xffff - (b>>16)&0xffff;
    //use the variables so the optimiser doesn't get rid of everything
    return resultA resultB;
}

Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).

I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)

Here is the output of Compiler Explorer with no optimisation flag:

        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     dword ptr [rbp - 4], 0
        movabs  rdi, offset std::cin
        lea     rsi, [rbp - 6]
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
        mov     word ptr [rbp - 16], 15
        mov     ax, word ptr [rbp - 6]
        mov     word ptr [rbp - 14], ax
        movzx   eax, word ptr [rbp - 6]
        add     eax, 983040
        mov     dword ptr [rbp - 20], eax
Begin calculating result A
        movzx   eax, word ptr [rbp - 16]
        movzx   ecx, word ptr [rbp - 14]
        sub     eax, ecx
        mov     dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
        mov     eax, dword ptr [rbp - 20]
        mov     edx, dword ptr [rbp - 20]
        shr     edx, 16
        mov     ecx, 65535
        sub     ecx, edx
        and     eax, ecx
        and     eax, 65535
        mov     dword ptr [rbp - 28], eax
End of calculation
        mov     eax, dword ptr [rbp - 24]
        add     eax, dword ptr [rbp - 28]
        add     rsp, 32
        pop     rbp
        ret

I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).

main:                                   # @main
        push    rax
        lea     rsi, [rsp   6]
        mov     edi, offset std::cin
        call    std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
        movzx   ecx, word ptr [rsp   6]
        mov     eax, ecx
        and     eax, -16
        sub     eax, ecx
        add     eax, 15
        pop     rcx
        ret

Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?

This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?

[Why did I add C to the tag list? While the example code I used is in C , the concept of struct vs variable is very applicable to C too]

CodePudding user response：

Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.

Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.

Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.

Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct and a function taking an int with the same binary layout will have a different assumptions about where the data is.

This only matters for by-value single arguments however.

The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.

CodePudding user response：

When trying to answer questions of performance, examining unoptimized code is largely irrelevant.

As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.

Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a and b are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input is, but it does know that a and b will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.

As a general rule, compilers tend to do an exceptionally good job when dealing with structs that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.

CodePudding user response：

Using GCC on compiler explorer the version with the struct produces fewer instructions in -O3 mode.

Code:

#include <stdint.h> 

typedef struct {
    uint16_t low;
    uint16_t high;
} __attribute__((packed)) A;

typedef uint32_t B;

int f1(A a)
{
    return a.low - a.high;
}

int f2(B b)
{
    return b&0xffff - (b>>16)&0xffff;
}

Assembly:

_Z2f11A:
    movzwl  %di,