Here is an example piece of code:
#include <stdint.h>
#include <iostream>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int main() {
//simply to make the answer unknowable at compile time
uint16_t input;
cin >> input;
A a = {15,input};
B b = 0x000f0000 input;
//a equals b
int resultA = a.low-a.high;
int resultB = b&0xffff - (b>>16)&0xffff;
//use the variables so the optimiser doesn't get rid of everything
return resultA resultB;
}
Both resultA and resultB calculate the exact same thing - but which is faster (assuming you don't know the answer at compile time).
I tried using Compiler Explorer to look at the output, and I got something - but with any optimisation no matter what I tried it outsmarted me and optimised the whole calculation away (at first, it optimised everything away since it's not used) - I tried using cin to make the answer unknowable at runtime, but then I couldn't even figure out how it was getting the answer at all (I think it managed to still figure it out at compile time?)
Here is the output of Compiler Explorer with no optimisation flag:
push rbp
mov rbp, rsp
sub rsp, 32
mov dword ptr [rbp - 4], 0
movabs rdi, offset std::cin
lea rsi, [rbp - 6]
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
mov word ptr [rbp - 16], 15
mov ax, word ptr [rbp - 6]
mov word ptr [rbp - 14], ax
movzx eax, word ptr [rbp - 6]
add eax, 983040
mov dword ptr [rbp - 20], eax
Begin calculating result A
movzx eax, word ptr [rbp - 16]
movzx ecx, word ptr [rbp - 14]
sub eax, ecx
mov dword ptr [rbp - 24], eax
End of calculation
Begin calculating result B
mov eax, dword ptr [rbp - 20]
mov edx, dword ptr [rbp - 20]
shr edx, 16
mov ecx, 65535
sub ecx, edx
and eax, ecx
and eax, 65535
mov dword ptr [rbp - 28], eax
End of calculation
mov eax, dword ptr [rbp - 24]
add eax, dword ptr [rbp - 28]
add rsp, 32
pop rbp
ret
I will also post the -O1 output, but I can't make any sense of it (I'm quite new to low level assembly stuff).
main: # @main
push rax
lea rsi, [rsp 6]
mov edi, offset std::cin
call std::basic_istream<char, std::char_traits<char> >::operator>>(unsigned short&)
movzx ecx, word ptr [rsp 6]
mov eax, ecx
and eax, -16
sub eax, ecx
add eax, 15
pop rcx
ret
Something to consider. While doing operations with the integer is slightly harder, simply accessing it as an integer easier compared to the struct (which you'd have to convert with bitshifts I think?). Does this make a difference?
This originally came up in the context of memory, where I saw someone map a memory address to a struct with a field for the low bits and the high bits. I thought this couldn't possibly be faster than simply using an integer of the right size and bitshifting if you need the low or high bits. In this specific situation - which is faster?
[Why did I add C to the tag list? While the example code I used is in C , the concept of struct vs variable is very applicable to C too]
CodePudding user response:
Other than the fact that some ABIs require that structs be passed differently than integers, there won't be a difference.
Now, there are important semantic differences between two 16 bit ints and one 32 bit int. If you add to the lower 16 bit int, it will not "overflow" into the higher one, while if you add to the lower 16 bits of a 32 bit int, it will. This difference in possible behavior (even if you, yourself, "know" it could not happen in your code) could change what assembly code is generated by your compiler, and impact performance.
Which of those two would result in a faster result is not going to be knowable without actually testing or a full description of the actual exact problem. So it is a toss up there.
Which means the only real concern is the ABI one. This means, without whole program optimization, a function taking a struct
and a function taking an int
with the same binary layout will have a different assumptions about where the data is.
This only matters for by-value single arguments however.
The 90/10 rule applies; 90% of your code runs for less than 10% of the time. The odds are this will have no impact on your critical path.
CodePudding user response:
When trying to answer questions of performance, examining unoptimized code is largely irrelevant.
As a matter of fact, even examining the results of -O1 optimization is not particularly useful, because it does not give you the best that the compiler can achieve. You should try at least -O2.
Regardless of the above, the sample code you provided is unsuitable for examination, because you should be making sure that the values of a
and b
are separately unknowable by the compiler. As the code stands, the compiler does not know what the value of input
is, but it does know that a
and b
will have the same value, so it optimizes the code in ways that make it impossible to derive any useful conclusions from it.
As a general rule, compilers tend to do an exceptionally good job when dealing with struct
s that fit within machine words, to the point where generally, there is absolutely no performance difference between the two scenarios you are considering, and between any of the special cases you are pondering about.
CodePudding user response:
Using GCC
on compiler explorer the version with the struct produces fewer instructions in -O3
mode.
Code:
#include <stdint.h>
typedef struct {
uint16_t low;
uint16_t high;
} __attribute__((packed)) A;
typedef uint32_t B;
int f1(A a)
{
return a.low - a.high;
}
int f2(B b)
{
return b&0xffff - (b>>16)&0xffff;
}
Assembly:
_Z2f11A:
movzwl %di,