Why is this = loop faster than the equivalent = loop?-CodePudding

I wrote this simple loop in C (Visual Studio 2019 in Release mode) that goes through one million 32-bit integers and assigns each to the previous:

    for ( int i = 1; i < Iters; i   )
        ints32[i - 1] = ints32[i];

And here's a loop that's the same except it uses =:

    for ( int i = 1; i < Iters; i   )
        ints32[i - 1]  = ints32[i];

I'm measuring both of these functions by running them 20 times, discarding the 5 slowest and 5 fastest runs, and averaging the rest. (Before each measurement, I fill the array with random integers.) I find that the assign-only loop takes around 460 microseconds, but the loop with addition takes 320 microseconds. These measurements are consistent across multiple runs (they vary a bit, but the addition is always faster), even when I change the order in which I measure the two functions.

Why is the loop with addition faster than the loop without addition? I would think the addition would make it take longer, if anything.

Here is the disassembly, and you can see that the functions are equivalent except that the addition loop does more work (and sets eax to ints[i - 1] instead of ints[i]).

    for ( int i = 1; i < Iters; i   )
00541670 B8 1C 87 54 00       mov         eax,54871Ch  
        ints32[i - 1] = ints32[i];
00541675 8B 08                mov         ecx,dword ptr [eax]  
00541677 89 48 FC             mov         dword ptr [eax-4],ecx  
0054167A 83 C0 04             add         eax,4  
0054167D 3D 18 90 91 00       cmp         eax,offset floats32 (0919018h)  
00541682 7C F1                jl          IntAssignment32 5h (0541675h)  
}
00541684 C3                   ret  

    for ( int i = 1; i < Iters; i   )
00541700 B8 18 87 54 00       mov         eax,offset ints32 (0548718h)  
        ints32[i - 1]  = ints32[i];
00541705 8B 08                mov         ecx,dword ptr [eax]  
00541707 03 48 04             add         ecx,dword ptr [eax 4]  
0054170A 89 08                mov         dword ptr [eax],ecx  
0054170C 83 C0 04             add         eax,4  
0054170F 3D 14 90 91 00       cmp         eax,919014h  
00541714 7C EF                jl          IntAddition32 5h (0541705h)  
}
00541716 C3                   ret

(The int array is volatile because I didn't want the compiler to optimize it away or vectorize it or anything, and indeed the disassembly it's producing is what I wanted to measure.)

Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

I'm using the default Visual Studio 2019 Win32 Release compiler options (copied this from the project properties):

/permissive- /ifcOutput "Release\" /GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\profile.pch" /diagnostics:column

Compiler version: Microsoft Visual Studio Community 2019 Version 16.11.15

Here's the complete code:

#include <iostream>
#include <chrono>
#include <random>
#include <algorithm>

const int ValueRange = 100000000;
std::default_random_engine generator;
std::uniform_int_distribution< int > distribution( 1, ValueRange - 1 );

const int Iters = 1000000; // nanoseconds -> milliseconds

volatile int ints32[Iters];

void InitArrayInt32()
{
    for ( int i = 0; i < Iters; i   )
        ints32[i] = distribution( generator );
}

const int SampleCount = 20;
const int KeepSampleCount = SampleCount - 2 * (SampleCount / 4);

float ProfileFunction( void(*setup)(), void(*func)() )
{
    uint64_t times[SampleCount];

    for ( int i = 0; i < SampleCount; i   )
    {
        setup();

        auto startTime = std::chrono::steady_clock::now();

        func();

        auto endTime = std::chrono::steady_clock::now();
        times[i] = std::chrono::duration_cast<std::chrono::microseconds>( endTime - startTime ).count();
    }

    std::sort( times, times   SampleCount );
    uint64_t total = 0;
    for ( int i = SampleCount / 4; i < SampleCount - SampleCount / 4; i   )
        total  = times[i];
    return total * (1.0f / KeepSampleCount);
}

void IntAssignment32()
{
    for ( int i = 1; i < Iters; i   )
        ints32[i - 1] = ints32[i];
}
void IntAddition32()
{
    for ( int i = 1; i < Iters; i   )
        ints32[i - 1]  = ints32[i];
}

int main()
{
    float assignment = ProfileFunction( InitArrayInt32, IntAssignment32 );
    float addition = ProfileFunction( InitArrayInt32, IntAddition32 );
    printf( "assignment: %g\n", assignment );
    printf( "addition: %g\n", addition );
    return 0;
}

CodePudding user response：

Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

The reason why one is faster then the other is luck. The variance in speed caused by the alignment of the code and the data has far more influence than the minor difference in the generated code.

Even though you measure multiple times what you measure is always the same alignment (to some degree if you have address space randomization). This eliminates the noise from the scheduler and interrupts but not the noise from alignment changes.

But when you change something in the code you change the alignment, e.g. shift the array by 4 byte. Of shift the code by 1 byte. And that has actually a larger impact to the speed than the difference between -O0 and -O2 in gcc. There are lots of papers on this effect and more.