I've made an implementation of two scan functions in C, because I wanted to test windows' QueryPerformance
API. According to the benchmark data, lscan
finishes in 0.000029
whilst bscan
finishes in 0.000016
, though I cannot see why is that.
I test it with an input source of 60157 characters and no optimization flags.
void bscan (char *source, size_t sourceLen)
{
int i = 0;
int b = sourceLen-1;
int m = (int)((double)sourceLen/2.0);
while(i < m)
{
if(source[i] == '.'); /* Do something */
if(source[b] == '.'); /* Do something */
i ;
b--;
}
if(i==b)
{
if(source[i] == '.'); /* Do something */
}
}
void lscan (char *source, size_t sourceLen)
{
int i = 0;
while(i < sourceLen)
{
if(source[i] == '.'); /* Do something */
i ;
}
}
Is bscan
an actual thing that one can use to achieve such a performance gain and if so, what makes it that much faster?
Revised code
int bscan (char *source, size_t sourceLen)
{
int c = 0;
int i = 0;
int b = sourceLen-1;
int m = sourceLen/2.0;
while(i < m)
{
if(source[i] == '.') c ;
if(source[b] == '.') c ;
i ;
b--;
}
if(i==b)
{
if(source[i] == '.') c ;
}
return c;
}
int lscan (char *source, size_t sourceLen)
{
int i = 0, c = 0;
while(i < sourceLen)
{
if(source[i] == '.') c ;
i ;
}
return c;
}
Remark: The performance difference in the code here is smaller.
CodePudding user response:
Your bscan()
will perform about half as many index comparisons (i < m
and i == b
) as your lscan
will (i < sourcelen
) on the same input. Accordingly, bscan()
will also perform fewer branches. It appears that all other operations are the same. Your routines do very little else, so it is entirely plausible that these loop-control operations would be significant enough contributors to the overall execution time to explain the performance difference you see.
That's also consistent with the observation that adding work in the loop body reduces the (relative, I assume) performance difference.
CodePudding user response:
TL:DR: unrolling with two dependency chains (i
and b--
) gets twice as much work done in the same ~6 cycles of bottleneck that creates in a non-optimized build.
Your result is explainable given known compiler and CPU behaviour, but tells you nothing about how it would perform for real, with optimization enabled. (Or with non-empty if()
bodies that might branch mispredict if not compiled to branchless increments).
It also has nothing to do with theoretical big-O complexity, purely engineering / micro-optimization effects in the anti-optimized asm. Counting number of operations isn't a useful approach to predict performance except sometimes within an order of magnitude. (Or two orders if a difference in cache misses is expected.)
You compiled without optimization, so it's not a realistic benchmark that tells you anything about what's normally efficient in C. See Idiomatic way of performance evaluation? and C loop optimization help for final assignment (with compiler optimization disabled) re: why this is not a useful way to microbenchmark, and the weird distortions it introduces.
The default (-O0
for GCC/clang, but not ICC) is basically anti-optimization, not keeping variables in registers across C statements, which introduces bottlenecks that wouldn't exist in code built with normal options like -O2
or -O3 -march=native
.
Specifically, store-to-load forwarding latency creating long dependency chains through the integer loop-counters. (About 6 cycle latency for load/add/store vs. 1 cycle for increment a register, on recent Intel and AMD.
https://uops.info/ and https://agner.org/optimize/.
Except for Zen 2 and Ice Lake with their memory renaming that can store-forward with 0 extra latency in cases like this. On those CPUs, I'd expect both loops to have similar performance). You didn't say what CPU you tested on, but I'm guessing some kind of modern x86. Things would be pretty much the same on modern CPUs of other ISAs, at least ones with out-of-order exec so they can overlap independent work across iterations.
Optimizing for this way of compiling often complicates your source for no actual benefit, or at best means you use register int i
.
Your bidirectional loop effectively unrolls with two separate dependency chains, splitting the length in two. That 6 cycle per iteration bottleneck is plenty for modern out-of-order exec CPUs to keep up with the throughput of doing all the per-iteration work that depends on reading each i
(and b--
) result. Even in the loop that does both; 6 cycles is a long time for a CPU that can issue 4 instructions per clock.
So I expect cycles per iteration is about the same for both loops, just bottlenecked on store-forwarding latency. But the bidirectional loop only has to run half the number of iterations.
In fact, the loop with more work might actually run faster per iteration, due to some interesting effects on Sandybridge-family CPUs where not trying to reload too soon can actually get store-forwarding to work with less latency. See Adding a redundant assignment speeds up code when compiled without optimization. So it could be more than twice as fast in a debug build.
Of course, if you actually want to count matches, these if() c
statements hopefully get vectorized at -O3
to SIMD compares. Or do it manually with AVX2 or SSE2 intrinsics: How to count character occurrences using SIMD just bottlenecks on memory bandwidth unless data is already hot in L1d or maybe L2 cache.
In the version of code you measured, the if
bodies are empty so there's no actual branch in the asm for the CPU to mispredict even if there is a mix of '.'
and other characters. In the version with a counter, if they're all '.'
then an unoptimized build of the bidirectional version will slow down to the same speed as other version, bottlenecked on the latency of incrementing c
twice per loop iteration.
I'm assuming that branch mispredictions either weren't significant, hurt both equally, or didn't exist because you measured the code you claimed you did, which had empty if()
bodies.
GCC10.4 -O0 -mabi=ms
on Godbolt should be similar code-gen to what you got from MinGW GCC10.4. Note that there are no asm instructions corresponding to the if(source[i] == '.');
source lines, just the loop-counter increment and compare logic. So fortunately that removes all memory accesses to the char array, removing any possible page-fault overhead in case you didn't read or write to it before the timed region.
With optimization enabled, unrolling with multiple accumulators is usually only important for floating-point stuff like summing an array or a dot product, since FP math instructions typically have higher latency, and integer math is associative so compilers can just unroll loop counters for you. But artificially creating latency bottlenecks for integer loop counters creates the same effects as in questions like Latency bounds and throughput bounds for processors for operations that must occur in sequence.
Further reading:
- What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
- Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)
- Latency bounds and throughput bounds for processors for operations that must occur in sequence
- How many CPU cycles are needed for each assembly instruction? (that's not how performance works; there isn't a single number you can add up across instructions to get total cost. There are 3 dimensions: front-end cost, latency, and back-end port cost. You have to consider each of those 3 as possible bottlenecks.)
- https://agner.org/optimize/ - how CPUs work. Also the x86 tag wiki
- C loop optimization help for final assignment (with compiler optimization disabled) - why it's silly not to enable optimization, and a big distortion. But if you insist on compiling that way, some things that make a difference only for that.
- Idiomatic way of performance evaluation? benchmarking pitfalls including this (unoptimized builds), CPU warmup, and avoiding page-faults.