What's the problem
I am benchmarking the following code for (T& x : v) x = x x;
where T is int
.
When compiling with mavx2
Performance fluctuates 2 times depending on some conditions.
This does not reproduce on sse4.2
I would like to understand what's happening.
How does the benchmark work
I am using Google Benchmark. It spins the loop until the point it is sure about the time.
The main benchmarking code:
using T = int;
constexpr std::size_t size = 10'000 / sizeof(T);
NOINLINE std::vector<T> const& data()
{
static std::vector<T> res(size, T{2});
return res;
}
INLINE void double_elements_bench(benchmark::State& state)
{
auto v = data();
for (auto _ : state) {
for (T& x : v) x = x x;
benchmark::DoNotOptimize(v.data());
}
}
Then I call double_elements_bench
from multiple instances of a benchmark driver.
Machine, Compiler, Options
- processor: intel 9700k
- compiler: clang ~14, built from trunk.
- options:
-mavx2 --std=c 20 --stdlib=libc -DNDEBUG -g -Werror -Wall -Wextra -Wpedantic -Wno-deprecated-copy -O3
I did align all functions to 128 to try, had no effect.
Results
When duplicated 2 times I get:
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 105 ns 105 ns 6617708
double_elements_1 105 ns 105 ns 6664185
Vs duplicated 3 times:
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 64.6 ns 64.6 ns 10867663
double_elements_1 64.5 ns 64.5 ns 10855206
double_elements_2 64.5 ns 64.5 ns 10868602
This reproduces on bigger data sizes too.
Perf stats
I looked for counters that I know can be relevant to code alignment
LSD cache (which is off on my machine due to some security issue a few years back), DSB cache and branch predictor:
LSD.UOPS,idq.dsb_uops,UOPS_ISSUED.ANY,branches,branch-misses
Slow case
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 105 ns 105 ns 6663885
double_elements_1 105 ns 105 ns 6632218
Performance counter stats for './transform_alignment_issue':
0 LSD.UOPS
13,830,353,682 idq.dsb_uops
16,273,127,618 UOPS_ISSUED.ANY
761,742,872 branches
34,107 branch-misses # 0.00% of all branches
1.652348280 seconds time elapsed
1.633691000 seconds user
0.000000000 seconds sys
Fast case
------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------
double_elements_0 64.5 ns 64.5 ns 10861602
double_elements_1 64.5 ns 64.5 ns 10855668
double_elements_2 64.4 ns 64.4 ns 10867987
Performance counter stats for './transform_alignment_issue':
0 LSD.UOPS
32,007,061,910 idq.dsb_uops
37,653,791,549 UOPS_ISSUED.ANY
1,761,491,679 branches
37,165 branch-misses # 0.00% of all branches
2.335982395 seconds time elapsed
2.317019000 seconds user
0.000000000 seconds sys
Both look to me about the same.
UPD
I think this might be alignment of the data returned from malloc
0x4f2720 in fast case and 0x8e9310 in slow
So - since clang does not align - we get unaligned reads/writes. I tested on a transform that aligns - does not seem to have this variation.
Is there a way to confirm it?
CodePudding user response:
Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.
But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".
ld_blocks.no_sr
counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.
When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)
- https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
- Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
- How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.
The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr
tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.
You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)
On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.
Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.
Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.
Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops
. (A significant number aren't, but not a big percentage difference between slow vs. fast.)
How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops
isn't closer to your uops_issued.any
.