I wrote this simple loop in C (Visual Studio 2019 in Release mode) that goes through one million 32-bit integers and assigns each to the previous:
for ( int i = 1; i < Iters; i )
ints32[i - 1] = ints32[i];
And here's a loop that's the same except it uses =
:
for ( int i = 1; i < Iters; i )
ints32[i - 1] = ints32[i];
I'm measuring both of these functions by running them 20 times, discarding the 5 slowest and 5 fastest runs, and averaging the rest. (Before each measurement, I fill the array with random integers.) I find that the assign-only loop takes around 460 microseconds, but the loop with addition takes 320 microseconds. These measurements are consistent across multiple runs (they vary a bit, but the addition is always faster), even when I change the order in which I measure the two functions.
Why is the loop with addition faster than the loop without addition? I would think the addition would make it take longer, if anything.
Here is the disassembly, and you can see that the functions are equivalent except that the addition loop does more work (and sets eax
to ints[i - 1]
instead of ints[i]
).
for ( int i = 1; i < Iters; i )
00541670 B8 1C 87 54 00 mov eax,54871Ch
ints32[i - 1] = ints32[i];
00541675 8B 08 mov ecx,dword ptr [eax]
00541677 89 48 FC mov dword ptr [eax-4],ecx
0054167A 83 C0 04 add eax,4
0054167D 3D 18 90 91 00 cmp eax,offset floats32 (0919018h)
00541682 7C F1 jl IntAssignment32 5h (0541675h)
}
00541684 C3 ret
for ( int i = 1; i < Iters; i )
00541700 B8 18 87 54 00 mov eax,offset ints32 (0548718h)
ints32[i - 1] = ints32[i];
00541705 8B 08 mov ecx,dword ptr [eax]
00541707 03 48 04 add ecx,dword ptr [eax 4]
0054170A 89 08 mov dword ptr [eax],ecx
0054170C 83 C0 04 add eax,4
0054170F 3D 14 90 91 00 cmp eax,919014h
00541714 7C EF jl IntAddition32 5h (0541705h)
}
00541716 C3 ret
(The int array is volatile
because I didn't want the compiler to optimize it away or vectorize it or anything, and indeed the disassembly it's producing is what I wanted to measure.)
Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?
I'm using the default Visual Studio 2019 Win32 Release compiler options (copied this from the project properties):
/permissive- /ifcOutput "Release\" /GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\profile.pch" /diagnostics:column
Compiler version: Microsoft Visual Studio Community 2019 Version 16.11.15
Here's the complete code:
#include <iostream>
#include <chrono>
#include <random>
#include <algorithm>
const int ValueRange = 100000000;
std::default_random_engine generator;
std::uniform_int_distribution< int > distribution( 1, ValueRange - 1 );
const int Iters = 1000000; // nanoseconds -> milliseconds
volatile int ints32[Iters];
void InitArrayInt32()
{
for ( int i = 0; i < Iters; i )
ints32[i] = distribution( generator );
}
const int SampleCount = 20;
const int KeepSampleCount = SampleCount - 2 * (SampleCount / 4);
float ProfileFunction( void(*setup)(), void(*func)() )
{
uint64_t times[SampleCount];
for ( int i = 0; i < SampleCount; i )
{
setup();
auto startTime = std::chrono::steady_clock::now();
func();
auto endTime = std::chrono::steady_clock::now();
times[i] = std::chrono::duration_cast<std::chrono::microseconds>( endTime - startTime ).count();
}
std::sort( times, times SampleCount );
uint64_t total = 0;
for ( int i = SampleCount / 4; i < SampleCount - SampleCount / 4; i )
total = times[i];
return total * (1.0f / KeepSampleCount);
}
void IntAssignment32()
{
for ( int i = 1; i < Iters; i )
ints32[i - 1] = ints32[i];
}
void IntAddition32()
{
for ( int i = 1; i < Iters; i )
ints32[i - 1] = ints32[i];
}
int main()
{
float assignment = ProfileFunction( InitArrayInt32, IntAssignment32 );
float addition = ProfileFunction( InitArrayInt32, IntAddition32 );
printf( "assignment: %g\n", assignment );
printf( "addition: %g\n", addition );
return 0;
}
CodePudding user response:
Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?
The reason why one is faster then the other is luck. The variance in speed caused by the alignment of the code and the data has far more influence than the minor difference in the generated code.
Even though you measure multiple times what you measure is always the same alignment (to some degree if you have address space randomization). This eliminates the noise from the scheduler and interrupts but not the noise from alignment changes.
But when you change something in the code you change the alignment, e.g. shift the array by 4 byte. Of shift the code by 1 byte. And that has actually a larger impact to the speed than the difference between -O0 and -O2 in gcc. There are lots of papers on this effect and more.