How can I measure the speed difference of for loop?-CodePudding

I am curious about the items below in for loop.

for(auto) vs for(auto &)
Separating the for loop
for(auto &) vs for(const auto &)
for(int : list) vs for(auto : list) [list is integer vector]

So, I wrote the below code for testing in the C 17 version.

It looks like seems difference in CMake debug mode(without optimization)

// In debug mode
1. elapsed: 7639 (1663305922550 - 1663305914911)
2. elapsed: 3841 (1663305926391 - 1663305922550)
3. elapsed: 3810 (1663305930201 - 1663305926391)

But in release mode(with gcc -O3) there is no difference between 1 ~ 3

// release mode
1. elapsed: 0 (1663305408984 - 1663305408984)
2. elapsed: 0 (1663305409984 - 1663305409984)
3. elapsed: 0 (1663305410984 - 1663305410984)

I don't know if my test method is wrong, Or is it correct that there is no difference depending on the optimization status?

Here is my testing source code.

// create test vector
const uint64_t max_ = 499999999;    // 499,999,999
std::vector<int>   v;
for (int i = 1; i < max_; i  )
    v.push_back(i);


// test 1.
auto start1 = getTick();
for (auto& e : v)
{
    auto t = e   100;    t  = 300;
}
for (auto& e : v)
{
    auto t = e   200;    t  = 300;
}
auto end1 = getTick();


// test 2.
// Omit tick function
for (auto& e : v)
{
    auto t1 = e   100;    t1  = 300;
    auto t2 = e   200;    t2  = 300;
}


// test 3.
for (auto e : v)
{
    auto t1 = e   100;    t1  = 300;
    auto t2 = e   200;    t2  = 300;
}

...

And then, getTick() was obtained through chrono milliseconds.

uint64_t getTick()
{
    return (duration_cast<milliseconds>(system_clock::now().time_since_epoch()).count());
}

Also, this testing progressed on Debian aarch64

Jetson Xavier NX (jetpack 4.6, ubuntu 18.04LTS)
8Gb RAM
GCC 7.5.0

Please advise if there is anything wrong. Thank you!

CodePudding user response：

An empty loop can optimize away, so your compiler correctly does that. But benchmarking with optimization disabled is not meaningful. C requires optimization to get the performance we expect for production use (especially with template library functions), and optimization or not isn't a constant factor speedup; it makes different ways to express the same logic lead to different asm, when in a normally optimized build they'd compile to the same asm.

You can't infer anything from a debug build about what's faster in a release build, not about small-scale micro-optimization things like this. See also Idiomatic way of performance evaluation?

With optimization enabled, copying to a local object can remove most of the work if you only use one member of that copy. Get used to thinking of what real work actually needs to happen for the code; often a compiler will figure out what that minimum is. For example, auto & isn't actually going to put a pointer in a register and dereference it beyond what it was doing to loop over the array in the first place, the reference variable doesn't actually exist anywhere in the asm as a separate value in a register or memory.

So this isn't something you can isolate in a benchmark without some real work in the loop, e.g. summing an array, or modifying every element. You could try using something like Benchmark::DoNotOptimize or similar inline asm to make the compiler materialize a value in a register without doing anything else, but to be sure you're benchmarking exactly the right thing, you need to understand asm and check the compiler output. (Microbenchmarking is hard!) In which case you probably can already answer the question just by looking at the asm and seeing that it's the same either way in normal cases.

It's probably easier to just check which things all compile to the same asm with optimization enabled, instead of trying to guess whether small differences in experimental timing are due to noise or might be a real difference. (And if there is a difference, whether it's just a coincidence that it was faster with this luck of the draw for code alignment and surrounding code, on this particular CPU.)

CodePudding user response：

-O3 mode will do the compile optimization and remove the code. You can try to declare variables in for-loop global to avoid compiler optimization.