Home > Mobile >  Is the if-branch faster than the else branch?
Is the if-branch faster than the else branch?

Time:10-09

I came across this very nice infographic which gives a rough estimation about the CPU-cylces used for certain operations. While studying I noticed an entry "Right branch of if" which I assumes is the branch "if" is going to take if the condition is met (EDIT: As pointed out in the comments "Right" actually means "a rightly predicted branch"). It made me wonder if there's any (even so minor) speed difference in the if-branch compared to the else branch.

Compare for example the following very condensed code:

5!!"); a = 2; } else { printf("a <= 5!!"); a = 3; } }'),l:'5',n:'0',o:'C++ source #1',t:'0')),k:47.31070496083551,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:gsnapshot,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'0',intel:'0',libraryCode:'0',trim:'1'),flagsViewOpen:'1',fontScale:14,fontUsePx:'0',j:1,lang:c++,libs:!(),options:'-Wall -Os --std=c++20',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1,tree:'1'),l:'5',n:'0',o:'x86-64 gcc (trunk) (C++, Editor #1, Compiler #1)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:output,i:(compilerName:'x86-64 gcc 12.2',editorid:1,fontScale:14,fontUsePx:'0',j:1,wrap:'1'),l:'5',n:'0',o:'Output of x86-64 gcc (trunk) (Compiler #1)',t:'0')),k:50,l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:52.689295039164485,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4" rel="nofollow noreferrer">Demo

#include <cstdio>

volatile int a = 2;

int main()
{
    if (a > 5) {
        printf("a > 5!");
        a = 2;
    } else {
        printf("a <= 5!");
        a = 3;
    }
}

Which generates this assembly in x86 64bit:

.LC0:
        .string "a > 5!"
.LC1:
        .string "a <= 5!"
main:
        push    rcx
        mov     eax, DWORD PTR a[rip]
        cmp     eax, 5
        jle     .L2
        mov     edi, OFFSET FLAT:.LC0
        xor     eax, eax
        call    printf
        mov     DWORD PTR a[rip], 2
        jmp     .L3
.L2:
        mov     edi, OFFSET FLAT:.LC1
        xor     eax, eax
        call    printf
        mov     DWORD PTR a[rip], 3
.L3:
        xor     eax, eax
        pop     rdx
        ret
a:
        .long   2

As you can see the right branch which calls printf for "a > 5" lives closer to the cmp - instruction. Looking at the data cache a situation might therefore arise where the instructions to call printf are already loaded in a cache line whereas the else branch must first be fetched from L2 or L3 cache. Could this (in theory) lead to a tiny speed up of the "Right" branch before the branch prediction pattern is established?

CodePudding user response:

No, there's no real difference between the if and else branch, and that's not what the graphic you linked means by the "right" branch. By "right", it means that the CPU's branch prediction correctly guessed which branch would be taken before the comparison was done.

Modern CPUs are pipelined. They start working on the next instruction before they're done with the current one. This can significantly improve performance, since it means that the CPU can keep all of its different sub-components (the instruction decoder, different parts of the ALU, the memory controller, etc) busy all the time instead of leaving them idle while another component of the CPU does work. I.e. it can be performing an addition, fetching an operand from memory, and decoding an instruction all at the same time.

This pipelining depends on knowing what instruction will be executed next though, and conditional branch instructions throw a wrench in it. In your example, the CPU doesn't know if the next instruction that will need to be executed after jle .L2 is mov edi, OFFSET FLAT:.LC0 or mov edi, OFFSET FLAT:.LC1 until after the cmp instruction is done, so it guesses. It starts working on one of the branches, and when the cmp finally finishes it looks to see if it guessed right. If it did, great, the partial work it's done on the following instructions is valid and it keeps going like normal. If it guessed wrong though, all of the work it did on instructions after the jle has to be thrown away, and it starts working on the other branch, which costs some time. An occasional incorrect guess won't make a noticeable performance difference, but if it guesses wrong very often it can start to add up and make a big difference.

See also the highest rated answer on StackOverflow.


Note that on some older or embedded CPUs the branch prediction algorithm is essentially just "always guess the first one", so on such CPUs the else path would be slower since branch prediction would never guess that by default. On those CPUs GCC's __builtin_expect or C 20's [[likely]] attributes can be useful to tell the compiler which branch is more likely so that it will generate assembly code to make the more likely path "the fist one", even if it's the else.

CodePudding user response:

Put it shortly, no on mainstream processors, but yes on some old/embedded processors.

Modern mainstream processors are very good at predicting conditions (assuming they are not executed for the first time or the test can be done early). When a branch is correctly predicted, there is almost no cost (beside the use of branching units executed on some specific ports). The processor can speculatively fetch and execute the instructions of the predicted conditional block. When the branch is not correctly predicted, the processor must perform a kind of rollback operation that is pretty expensive (typically 5-15 cycles).

Some embedded processors and old ones use static prediction algorithms. For example, they can assume that the branch is never taken. On such processors, executing the if block is typically faster than the else one assuming the compiler do not reorder the blocks (which is quite frequent when optimizations are enabled). Built-ins hints can be provided by developers so to help compiler to generate a code more efficiently executed by processors using static partitioning. Profile guided optimizations can be used to automatically find conditions that are often true/false and reorder branches accordingly for sake of performance.

Mainstream (server, desktop and high-end mobile) processors use mainly dynamic prediction algorithms. For example, they can track and memorize when a branch is taken or not so to know the probability of the branch to be taken in the future. A processor typically has a limited set of conditional branched tracked. Thus, when a code has too many (imbricated) conditional branch instructions, a static partitioning algorithm can be used as a fallback method instead.

It is worth mentioning that the prediction information can be reset/flushed in some cases, like when a process is pre-empted (as pointed out by @HenriqueBucher). This means prediction can be far less effective when there is many context-switches. Note that speculation can be partially controlled by some specific instruction set so to mitigate vulnerabilities like Spectre.

A jump to a far unpredicted location can be expensive since the processor needs to fetch instructions that may not be in the instruction cache. In your case, it certainly does not matter on mainstream x86-64 processors since they a recent x86-64 processor is expected to load all the instruction of the program in the cache pretty quickly. For example, a Skylake processor can fetch 16 bytes/cycle from the instruction cache while a Zen 2 processor reach 32 bytes/cycle. Both can load 64-byte cache lines to the instruction cache.

There is not a lot of public information about the branch prediction algorithm of recent Intel processors. The one of AMD Zen 2 processors is pretty well documented: it uses an efficient TAGE predictor combined with a Perceptron so to predict the outcome of a condition based on past statistics. You can find detailed information about its behaviour here.

  • Related