Assembly why is "lea eax, [eax eax*const]" faster than "imul eax, eax, const"-CodePudding

I'm using godbolt to get assembly of the following program:

#include <stdio.h>
volatile int a = 5;
volatile int res = 0;
int main() {
    res = a * 36;
    return 1;
}

If I use -Os optimization, the generated code is natural:

mov     eax, DWORD PTR a[rip]
imul    eax, eax, 36
mov     DWORD PTR res[rip], eax

But if I use -O2, the generated code is this:

mov     eax, DWORD PTR a[rip]
lea     eax, [rax rax*8]
sal     eax, 2
mov     DWORD PTR res[rip], eax

So instead of multiplying 5*36, it does 5 -> 5 5*8=45 -> 45*4 = 180. I assume this is because 1 imul is slower than 1 lea 1 shift left.

But in the lea instruction, it needs to calculate rax rax*8, which contains 1 addition 1 mul. So why is it still faster than just 1 imul? Is it because memory addressing inside lea is free?

Edit 1: also, how does [rax rax*8] get translated into machine code? Does it gets compiled down to additional 2 instructions (shl, rbx, rax, 3; add rax, rax, rbx;), or something else?

Edit 2: Surprising results below. I make a loop, then generate code using -O2, then copy the file and replace the segment above with code from -Os. So 2 assembly files are the same everywhere, except for the instructions we're benchmarking. Running on Windows, the commands are

gcc mul.c -O2 -S -masm=intel -o mulo2.s 
gcc mulo2.s -o mulo2
// replace line of code in mulo2.s, save as muls.s
gcc muls.s -o muls
cmd /v:on /c "echo !time! & START "TestAgente" /W mulo2 & echo !time!"
cmd /v:on /c "echo !time! & START "TestAgente" /W muls & echo !time!"

#include <stdio.h>

volatile int a = 5;
volatile int res = 0;

int main() {
    size_t LOOP = 1000 * 1000 * 1000;
    LOOP = LOOP * 10;
    size_t i = 0;
    while (i < LOOP) {
      i  ;
      res = a * 36;
    }

    return 0;
}

; mulo2.s
    .file   "mul.c"
    .intel_syntax noprefix
    .text
    .def    __main; .scl    2;  .type   32; .endef
    .section    .text.startup,"x"
    .p2align 4
    .globl  main
    .def    main;   .scl    2;  .type   32; .endef
    .seh_proc   main
main:
    sub rsp, 40
    .seh_stackalloc 40
    .seh_endprologue
    call    __main
    movabs  rdx, 10000000000
    .p2align 4,,10
    .p2align 3
.L2:
    mov eax, DWORD PTR a[rip]
    lea eax, [rax rax*8] ; replaces these 2 lines with
    sal eax, 2           ; imul eax, eax, 36
    mov DWORD PTR res[rip], eax
    sub rdx, 1
    jne .L2
    xor eax, eax
    add rsp, 40
    ret
    .seh_endproc
    .globl  res
    .bss
    .align 4
res:
    .space 4
    .globl  a
    .data
    .align 4
a:
    .long   5
    .ident  "GCC: (GNU) 9.3.0"

Surprisingly, the result is that the -Os version is consistently faster than -O2 (4.1s vs 5s average, Intel 8750H CPU, each .exe file is run several times). So in this case, the compiler has optimized wrongly. Could someone provide a new explanation given this benchmark?

CodePudding user response：

You can see the cost of instructions on most mainstream architecture here and there. Based on that and assuming you use for example an Intel Skylake processor, you can see that one 32-bit imul instruction can be computed per cycle but with a latency of 3 cycles. In the optimized code, 2 lea instructions (which are very cheap) can be executed per cycle with a 1 cycle latency. The same thing apply for the sal instruction (2 per cycle and 1 cycle of latency).

This means that the optimized version can be executed with only 2 cycle of latency while the first one takes 3 cycle of latency (not taking into account load/store instructions that are the same). Moreover, the second version can be better pipelined since the two instructions can be executed for two different input data in parallel thanks to a superscalar out-of-order execution. Note that two loads can be executed in parallel too although only one store can be executed in parallel per cycle. This means that the execution is bounded by the throughput of store instructions. Overall, only 1 value can only computed per cycle. AFAIK, recent Intel Icelake processors can do two stores in parallel like new AMD Ryzen processors. The second one is expected to be as fast or possibly faster on the chosen use-case (Intel Skylake processors). It should be significantly faster on very recent x86-64 processors.

Note that the lea instruction is very fast because the multiply-add is done on a dedicated CPU unit (hard-wired) and it only supports some specific constant for the multiplication (supported factors are 1, 2, 4 and 8, which mean that lea can be used to multiply an integer by the constants 2, 3, 4, 5, 8 and 9). This is why lea is faster than imul/mul.

UPDATE:

I can reproduce the slower execution with -O2 using GCC 10.3.

This slower execution may be due to the alignment of the loop instructions. Indeed, the loop instructions of the -O2 version cross a cache line boundary (they are stored in 2 cache line) while this is not the case with -Os (only 1 cache line is used). This introduces an additional cost as many processors can only load&decode no more than one cache line of code per cycle.

The generated assembly for the two versions can be found here. With -Os, we can see this loop:

401029: (16 bytes)
 mov    edx,DWORD PTR [rip 0x2ff9]        # 404028 <a>
 imul   edx,edx,0x24
 mov    DWORD PTR [rip 0x2ff8],edx        # 404030 <res>
 dec    rax
 jne    401029 <main 0x9>

With -O2, we can see this loop:

401030: (20 bytes)
 mov    eax,DWORD PTR [rip 0x2ff2]        # 404028 <a>
 lea    eax,[rax rax*8]
 shl    eax,0x2
 mov    DWORD PTR [rip 0x2fee],eax        # 404030 <res>
 sub    rdx,0x1
 jne    401030 <main 0x10>

Another additional probable source of slowdown with -O2 is that the loop is bigger and need more time to be decoded. This can have a huge impact on performance in this case since Skylake can only decode 16 bytes/cycle. Thus, the loop is likely bound by the speed of the decoder with -O2.

CodePudding user response：

tl;dr: Because LEA doesn't do full-fledged multiplication.

While @JeromeRichard's answer is correct, the underlying kernel of truth is hidden in its last sentence: With LEA, you can only multiple by a specific constant, which is a power of two. Thus, instead of needing a large dedicated circuit for multiplication, it only needs a small sub-circuit for shifting one of its operands by a fixed amount.

_{LEA can support other fixed shift amounts, such as 5 or 10, but those will still not be general multiplication: They can be realized as 5x = (x << 2) x or 10x = (x << 8) (x << 1), which is still much smaller than general multiplication.}