I'm using godbolt to get assembly of the following program:
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
res = a * 36;
return 1;
}
If I use -Os optimization, the generated code is natural:
mov eax, DWORD PTR a[rip]
imul eax, eax, 36
mov DWORD PTR res[rip], eax
But if I use -O2, the generated code is this:
mov eax, DWORD PTR a[rip]
lea eax, [rax rax*8]
sal eax, 2
mov DWORD PTR res[rip], eax
So instead of multiplying 5*36, it does 5 -> 5 5*8=45 -> 45*4 = 180. I assume this is because 1 imul is slower than 1 lea 1 shift left.
But in the lea instruction, it needs to calculate rax rax*8
, which contains 1 addition 1 mul. So why is it still faster than just 1 imul? Is it because memory addressing inside lea is free?
Edit 1: also, how does [rax rax*8]
get translated into machine code? Does it gets compiled down to additional 2 instructions (shl, rbx, rax, 3; add rax, rax, rbx;
), or something else?
Edit 2: Surprising results below. I make a loop, then generate code using -O2, then copy the file and replace the segment above with code from -Os. So 2 assembly files are the same everywhere, except for the instructions we're benchmarking. Running on Windows, the commands are
gcc mul.c -O2 -S -masm=intel -o mulo2.s
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"
#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
size_t LOOP = 1000 * 1000 * 1000;
LOOP = LOOP * 10;
size_t i = 0;
while (i < LOOP) {
i ;
res = a * 36;
}
return 0;
}
; mulo2.s
.file "mul.c"
.intel_syntax noprefix
.text
.def __main; .scl 2; .type 32; .endef
.section .text.startup,"x"
.p2align 4
.globl main
.def main; .scl 2; .type 32; .endef
.seh_proc main
main:
sub rsp, 40
.seh_stackalloc 40
.seh_endprologue
call __main
movabs rdx, 10000000000
.p2align 4,,10
.p2align 3
.L2:
mov eax, DWORD PTR a[rip]
lea eax, [rax rax*8] ; replaces these 2 lines with
sal eax, 2 ; imul eax, eax, 36
mov DWORD PTR res[rip], eax
sub rdx, 1
jne .L2
xor eax, eax
add rsp, 40
ret
.seh_endproc
.globl res
.bss
.align 4
res:
.space 4
.globl a
.data
.align 4
a:
.long 5
.ident "GCC: (GNU) 9.3.0"
Surprisingly, the result is that the -Os
version is consistently faster than -O2
(4.1s vs 5s average, Intel 8750H CPU, each .exe file is run several times). So in this case, the compiler has optimized wrongly. Could someone provide a new explanation given this benchmark?
CodePudding user response:
You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul
instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea
instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal
instruction (2 per cycle and 1 cycle of latency).
This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.
Note that the lea
instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea
is faster than imul
/mul
.
UPDATE:
I can reproduce the slower execution with -O2
using GCC 10.3.
This slower execution may be due to the alignment of the loop instructions. Indeed, the loop instructions of the -O2
version cross a cache line boundary (they are stored in 2 cache line) while this is not the case with -Os
(only 1 cache line is used). This introduces an additional cost as many processors can only load&decode no more than one cache line of code per cycle.
The generated assembly for the two versions can be found here. With -Os
, we can see this loop:
401029: (16 bytes)
mov edx,DWORD PTR [rip 0x2ff9] # 404028 <a>
imul edx,edx,0x24
mov DWORD PTR [rip 0x2ff8],edx # 404030 <res>
dec rax
jne 401029 <main 0x9>
With -O2
, we can see this loop:
401030: (20 bytes)
mov eax,DWORD PTR [rip 0x2ff2] # 404028 <a>
lea eax,[rax rax*8]
shl eax,0x2
mov DWORD PTR [rip 0x2fee],eax # 404030 <res>
sub rdx,0x1
jne 401030 <main 0x10>
Another additional probable source of slowdown with -O2
is that the loop is bigger and need more time to be decoded. This can have a huge impact on performance in this case since Skylake can only decode 16 bytes/cycle. Thus, the loop is likely bound by the speed of the decoder with -O2
.
Related:
- What's the purpose of the LEA instruction?
- Why is my loop much faster when it is contained in one cache line?
- How many ways-superscalar are modern Intel processors?
- https://en.wikipedia.org/wiki/Superscalar_processor
CodePudding user response:
tl;dr: Because LEA doesn't do full-fledged multiplication.
While @JeromeRichard's answer is correct, the underlying kernel of truth is hidden in its last sentence: With LEA, you can only multiple by a specific constant, which is a power of two. Thus, instead of needing a large dedicated circuit for multiplication, it only needs a small sub-circuit for shifting one of its operands by a fixed amount.
LEA can support other fixed shift amounts, such as 5 or 10, but those will still not be general multiplication: They can be realized as 5x = (x << 2) x or 10x = (x << 8) (x << 1), which is still much smaller than general multiplication.