Please consider the following minimal example minimal.cpp
(https://godbolt.org/z/x7dYes91M).
#include <immintrin.h>
#include <algorithm>
#include <ctime>
#include <iostream>
#include <numeric>
#include <vector>
#define NUMBER_OF_TUPLES 134'217'728UL
void transform(std::vector<int64_t>* input, std::vector<double>* output, size_t batch_size) {
for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch = batch_size) {
size_t endOfBatch = std::min(startOfBatch batch_size, NUMBER_OF_TUPLES);
for (size_t idx = startOfBatch; idx < endOfBatch;) {
if (endOfBatch - idx >= 8) {
auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);
idx = 8;
} else {
(*output)[idx] = static_cast<double>((*input)[idx]);
idx ;
}
}
asm volatile("" : : "r,m"(output->data()) : "memory");
}
}
void do_benchmark(size_t batch_size) {
std::vector<int64_t> input(NUMBER_OF_TUPLES);
std::vector<double> output(NUMBER_OF_TUPLES);
std::iota(input.begin(), input.end(), 0);
auto t = std::clock();
transform(&input, &output, batch_size);
auto elapsed = std::clock() - t;
std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl;
}
int main() {
do_benchmark(7UL);
do_benchmark(8UL);
do_benchmark(9UL);
}
It transforms the input
array of int64_t
to the output array of double
in batches of a given batch_size
.
We have inserted the following AVX-512 intrinsics in case there are still more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance
auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);
Otherwise, we fall back to the scalar implementation.
To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory")
call, to make sure that the output data is flushed after each batch.
It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU
using
clang -Wall -Wextra -march=cascadelake -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vnni -O3 minimal.cpp -o minimal
Executing the code, however, results in the following surprising output
Elapsed time for a batch size of 7: 204007
Elapsed time for a batch size of 8: 237600
Elapsed time for a batch size of 9: 209838
It shows, that for some reason, using a batch_size
of 8, the code is significantly slower.
However, both, using a batch_size
of 7 or 9, is significantly faster.
This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?
Edit:
Added perf
results for cache misses
Batch Size 7
Performance counter stats for process id '653468':
6,894,467,363 L1-dcache-loads (44.43%)
1,647,244,371 L1-dcache-load-misses # 23.89% of all L1-dcache accesses (44.43%)
7,548,224,648 L1-dcache-stores (44.43%)
6,726,036 L2-loads (44.43%)
3,766,847 L2-loads-misses # 56.61% of all LL-cache accesses (44.46%)
6,171,407 L2-loads-stores (44.45%)
6,764,242 LLC-loads (44.46%)
4,548,106 LLC-loads-misses # 68.35% of all LL-cache accesses (44.46%)
6,954,088 LLC-loads-stores (44.45%)
Batch Size 8
Performance counter stats for process id '654880':
1,009,889,247 L1-dcache-loads (44.41%)
1,413,152,123 L1-dcache-load-misses # 139.93% of all L1-dcache accesses (44.45%)
1,528,453,525 L1-dcache-stores (44.48%)
158,053,929 L2-loads (44.51%)
155,407,942 L2-loads-misses # 98.18% of all LL-cache accesses (44.50%)
158,335,431 L2-loads-stores (44.46%)
158,349,901 LLC-loads (44.42%)
155,902,630 LLC-loads-misses # 98.49% of all LL-cache accesses (44.39%)
158,447,095 LLC-loads-stores (44.39%)
11.011153400 seconds time elapsed
Batch Size 9
Performance counter stats for process id '656032':
1,766,679,021 L1-dcache-loads (44.38%)
1,600,639,108 L1-dcache-load-misses # 90.60% of all L1-dcache accesses (44.42%)
2,233,035,727 L1-dcache-stores (44.46%)
138,071,488 L2-loads (44.49%)
136,132,162 L2-loads-misses # 98.51% of all LL-cache accesses (44.52%)
138,020,805 L2-loads-stores (44.49%)
138,522,404 LLC-loads (44.45%)
135,902,197 LLC-loads-misses # 98.35% of all LL-cache accesses (44.42%)
138,122,462 LLC-loads-stores (44.38%)
CodePudding user response:
Your arrays are large and not aligned by 64, since you let std::vector<>
allocate them. Using 64-byte vectors, every misaligned load will span a boundary between two 64-byte cache lines. (And you'll trip over the page-split at the end of every 4k page, although that's rare enough in sequential access to not explain this.) Unlike with 32-byte load/store where only every other vector will be a cache-line split.
(Glibc's malloc
/ new
for large allocations typically keeps the first 16 bytes for bookkeeping, so the address it returns is 16 bytes past the start of a page, always misaligned by 32 and 64, always creating the worst case.)
512-bit vectors (on Skylake/Cascade Lake at least) are known to slow down with misaligned 64-byte loads/stores (more than AVX1/2 code with misaligned 32-byte ops). Even when arrays are so large that you'd expect it to just bottleneck on DRAM bandwidth and have time to sort out any misalignment penalties inside the core while waiting for cache lines to arrive.
Single-core DRAM bandwidth on a big Xeon is pretty low vs. a "client" CPU, especially for Skylake-family. (The mesh interconnect was new in that generation, and it's lower than in Broadwell Xeon. Apparently Ice Lake Xeon made a big improvement to max per-core DRAM bandwidth.) So even scalar code is able to saturate memory bandwidth.
(Or perhaps batch=7 was auto-vectorizing with -mprefer-vector-width=256
after fully unrolling the inner loop? No, it wasn't even inlining your loop, and not unswitching that loop into while(full vector left) vector;
/ while(any left) scalar;
, so you have pretty nasty asm that does a lot of branching for each vector and scalar.)
But for some reason code that only ever uses 64-byte loads and stores can't max out one core's bandwidth. But your experiment shows that even a pattern of 1 vector 1 scalar can help (batch=9), assuming that compiled to match the source.
I don't know why; maybe the load execution units run out of split buffers for handling loads that need data from two cache lines. (Perf event ld_blocks.no_sr
). But the scalar loads don't need a split buffer entry because they're always naturally aligned (to 8 bytes). So they can execute if dispatched, maybe triggering fetch of cache lines sooner.
(HW prefetch doesn't work across 4k page boundaries where physical memory might be discontiguous; the L2 streamer only sees physical addresses. So a demand load into the next 4k page can get HW prefetch started early enough to max out DRAM bandwidth to L2, where maybe that wasn't happening if later split vector loads weren't happening. 4k boundaries apply even if using 2M transparent hugepages; the hardware prefetcher doesn't get told that the fetches are part of a contiguous hugepage.)
Batch=9 also makes one of every eight vectors aligned, which might help slightly.
These are wild guesses about microarchitectural causes, not backed up by any performance experiments to test these hypotheses.
Testing with aligned buffers
If you want to at least test that it's misalignment responsible for the whole thing, either look into using a custom allocator for std::vector<int64_t, my_aligned_allocator>
and/or std::vector<double, my_aligned_allocator>
. (Modern approach to making std::vector allocate aligned memory). This is a good bet for production use, as it then works the same way as std::vector<int64_t>
, although the 2nd template parameter makes it not type compatible.
For a quick experiment, make them std::vector<__m512i>
and/or <__m512d>
and change the loop code. (And compile with at least C 17 to make the standard library respect alignof(T)
.) (Useful to see whether source or destination misalignment is the critical factor, or both.) For batch=8 you can directly loop over the vectors. In the general case you'll need to static_cast<char*>(src->data())
and do the appropriate pointer math if you want to test this way. GNU C might define behaviour of pointing an double*
into a __m512d
because it happens to be defined in terms of double
, but there are examples of pointing an int*
at a __m256i
not working as hoped. For a performance experiment, you can just check the asm and see if it's sane.
(Also you'd want to check that the compiler unrolled that inner loop, not actually branching inside a loop.)
Or use aligned_alloc
to get raw storage instead of std::vector
. But then you'd need to write to both arrays yourself to avoid page faults being part of the timed region for the first test, like std::vector
's constructor does. (Idiomatic way of performance evaluation?) (std::vector
is annoying when you don't want to write memory before your SIMD loop, since using .emplace_back
is a pain with SIMD intrinsics. Not to mention that it sucks at growing, unable to use realloc
in most C implementations to sometimes avoid having to copy.)
Or instead of writing an init loop or memset
, do a warm-up pass? Good idea anyway for AVX-512 to make sure the 512-bit execution units are warmed up, and the CPU is at a frequency where it's able to run 512-bit FP instructions at the lowish throughput needed. (SIMD instructions lowering CPU frequency)
(Maybe __attribute__((noinline,noipa))
on do_benchmark
, although I don't think Clang knows GCC's noipa
attribute = no inter-procedural analysis.)
CodePudding user response:
Why is this case so significantly slower, though?
You may have run into (L1) cache contention and eviction of cache lines when the size of batches results in a distance in memory between subsequent batches that is a multiple of the critical stride. See Section 9.2 (Cache organization) and Section 9.10 (Cache contention in large data structures) in Agner Fog's Optimizing software in C [emphasis mine]:
9.2 Cache organization
[...]
Most caches are organized into lines and sets. [...] Reading or writing a variable from address 0x2710 will cause the cache to load the entire 64 or 0x40 bytes from address 0x2700 to 0x273F into one of the four cache lines from set 0x1C. If the program afterwards reads or writes to any other address in this range then the value is already in the cache so we do not have to wait for another memory access.
Assume that a program reads from address 0x2710 and later reads from addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. These addresses all belong to set number 0x1C. There are only four cache lines in each set. If the cache always chooses the least recently used cache line then the line that covered the address range from 0x2700 to 0x273F will be evicted when we read from 0x4700. Reading again from address 0x2710 will cause a cache miss. But if the program had read from different addresses with different set values then the line containing the address range from 0x2700 to 0x273F would still be in the cache. The problem only occurs because the addresses are spaced a multiple of 0x800 apart. I will call this distance the critical stride. Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines. The critical stride can be calculated as
(critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways)
9.10 Cache contentions in large data structures
It is not always possible to access a multidimensional array sequentially. Some applications (e.g. in linear algebra) require other access patterns. This can cause severe delays if the distance between rows in a big matrix happen to be equal to the critical stride [...]
Each cache line has to be reloaded eight times because it is evicted before we need the next element. [...]
Matrix size Total kilobytes Time per element 63x63 31 11.6 64x64 32 16.4 65x65 33 11.8 127x127 126 12.2 128x128 128 17.4 129x129 130 14.4 ... ... ... Table 9.1. Time for transposition of different size matrices, clock cycles per element.
The table shows that it takes 40% more time to transpose the matrix when the size of the matrix is a multiple of the level-1 cache size. This is because the critical stride is a multiple of the size of a matrix line. [...]