Why does accessing global static variable improve performance compared to stack variables?-CodePudding

I was trying to understand performance of global static variable and came across a very weird scenario. The code below takes about 525ms average.

static unsigned long long s_Data = 1;

int main()
{
    unsigned long long x = 0;

    for (int i = 0; i < 1'000'000'000; i  )
    {
        x  = i   s_Data;
    }

    return 0;
}

and this code below takes 1050ms average.

static unsigned long long s_Data = 1;

int main()
{
    unsigned long long x = 0;

    for (int i = 0; i < 1'000'000'000; i  )
    {
        x  = i;
    }

    return 0;
}

I am aware that accessing static variables are fast, and writing to them is slow based on my other tests but I am not sure what piece of information I am missing out in the above scenario. Note: compiler optimizations were turned off and MSVC compiler was used to perform the tests.

CodePudding user response：

To address the actual question, with optimizations turned off, we can turn to the generated assembly to get an idea on why one runs more quickly than the other.

In the first test, GCC (trunk) https://godbolt.org/z/GdssT9vME produces this assembly

s_Data:
        .quad   1
main:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], 0
        mov     DWORD PTR [rbp-12], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-12]
        movsx   rdx, eax
        mov     rax, QWORD PTR s_Data[rip]
        add     rax, rdx
        add     QWORD PTR [rbp-8], rax
        add     DWORD PTR [rbp-12], 1
.L2:
        cmp     DWORD PTR [rbp-12], 999999999
        jle     .L3
        mov     eax, 0
        pop     rbp
        ret

The second test https://godbolt.org/z/5ndnEv5Ts we get

main:
        push    rbp
        mov     rbp, rsp
        mov     QWORD PTR [rbp-8], 0
        mov     DWORD PTR [rbp-12], 0
        jmp     .L2
.L3:
        mov     eax, DWORD PTR [rbp-12]
        cdqe
        add     QWORD PTR [rbp-8], rax
        add     DWORD PTR [rbp-12], 1
.L2:
        cmp     DWORD PTR [rbp-12], 999999999
        jle     .L3
        mov     eax, 0
        pop     rbp
        ret

Comparing these two programs, the first is sixteen instructions, while the second is only fourteen instructions. (I'm sure you can guess that different instructions also have different cpu cycle overheads)
How many CPU cycles are needed for each assembly instruction?

As noted in my comment, optimizations vastly change the generated assembly.
Both tests produce this with -O2

main:
        xor     eax, eax
        ret