Why is transforming an array using AVX-512 instructions significantly slower when transforming it in-CodePudding

Please consider the following minimal example minimal.cpp (https://godbolt.org/z/x7dYes91M).

#include <immintrin.h>

#include <algorithm>
#include <ctime>
#include <iostream>
#include <numeric>
#include <vector>

#define NUMBER_OF_TUPLES 134'217'728UL

void transform(std::vector<int64_t>* input, std::vector<double>* output, size_t batch_size) {
  for (size_t startOfBatch = 0; startOfBatch < NUMBER_OF_TUPLES; startOfBatch  = batch_size) {
    size_t endOfBatch = std::min(startOfBatch   batch_size, NUMBER_OF_TUPLES);

    for (size_t idx = startOfBatch; idx < endOfBatch;) {
      if (endOfBatch - idx >= 8) {
        auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
        auto _converted = _mm512_cvtepu64_pd(_loaded);

        _mm512_storeu_epi64(&(*output)[idx], _converted);
        idx  = 8;
      } else {
        (*output)[idx] = static_cast<double>((*input)[idx]);
        idx  ;
      }
    }

    asm volatile("" : : "r,m"(output->data()) : "memory");
  }
}

void do_benchmark(size_t batch_size) {
  std::vector<int64_t> input(NUMBER_OF_TUPLES);
  std::vector<double> output(NUMBER_OF_TUPLES);

  std::iota(input.begin(), input.end(), 0);

  auto t = std::clock();
  transform(&input, &output, batch_size);
  auto elapsed = std::clock() - t;

  std::cout << "Elapsed time for a batch size of " << batch_size << ": " << elapsed << std::endl;
}

int main() {
  do_benchmark(7UL);
  do_benchmark(8UL);
  do_benchmark(9UL);
}

It transforms the input array of int64_t to the output array of double in batches of a given batch_size. We have inserted the following AVX-512 intrinsics in case there are still more or equal than 8 tuples in the input, to process them all at once and therefore increase the performance

auto _loaded = _mm512_loadu_epi64(&(*input)[idx]);
auto _converted = _mm512_cvtepu64_pd(_loaded);
_mm512_storeu_epi64(&(*output)[idx], _converted);

Otherwise, we fall back to the scalar implementation.

To make sure that the compiler doesn't collapse the two loops, we use the asm volatile("" : : "r,m"(output->data()) : "memory") call, to make sure that the output data is flushed after each batch.

It is compiled and executed on an Intel(R) Xeon(R) Gold 5220R CPU using

clang   -Wall -Wextra -march=cascadelake -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mavx512vnni -O3 minimal.cpp -o minimal

Executing the code, however, results in the following surprising output

Elapsed time for a batch size of 7: 204007
Elapsed time for a batch size of 8: 237600
Elapsed time for a batch size of 9: 209838

It shows, that for some reason, using a batch_size of 8, the code is significantly slower. However, both, using a batch_size of 7 or 9, is significantly faster.

This is surprising to me, since a batch size of 8 should be the perfect configuration, since it only has to use the AVX-512 instructions and can always perfectly process 64 Byte at a time. Why is this case so significantly slower, though?

Edit:

Added perf results for cache misses

Batch Size 7

 Performance counter stats for process id '653468':

     6,894,467,363      L1-dcache-loads                                               (44.43%)
     1,647,244,371      L1-dcache-load-misses     #   23.89% of all L1-dcache accesses  (44.43%)
     7,548,224,648      L1-dcache-stores                                              (44.43%)
         6,726,036      L2-loads                                                      (44.43%)
         3,766,847      L2-loads-misses           #   56.61% of all LL-cache accesses  (44.46%)
         6,171,407      L2-loads-stores                                               (44.45%)
         6,764,242      LLC-loads                                                     (44.46%)
         4,548,106      LLC-loads-misses          #   68.35% of all LL-cache accesses  (44.46%)
         6,954,088      LLC-loads-stores                                              (44.45%)

Batch Size 8

 Performance counter stats for process id '654880':

     1,009,889,247      L1-dcache-loads                                               (44.41%)
     1,413,152,123      L1-dcache-load-misses     #  139.93% of all L1-dcache accesses  (44.45%)
     1,528,453,525      L1-dcache-stores                                              (44.48%)
       158,053,929      L2-loads                                                      (44.51%)
       155,407,942      L2-loads-misses           #   98.18% of all LL-cache accesses  (44.50%)
       158,335,431      L2-loads-stores                                               (44.46%)
       158,349,901      LLC-loads                                                     (44.42%)
       155,902,630      LLC-loads-misses          #   98.49% of all LL-cache accesses  (44.39%)
       158,447,095      LLC-loads-stores                                              (44.39%)

      11.011153400 seconds time elapsed

Batch Size 9

 Performance counter stats for process id '656032':

     1,766,679,021      L1-dcache-loads                                               (44.38%)
     1,600,639,108      L1-dcache-load-misses     #   90.60% of all L1-dcache accesses  (44.42%)
     2,233,035,727      L1-dcache-stores                                              (44.46%)
       138,071,488      L2-loads                                                      (44.49%)
       136,132,162      L2-loads-misses           #   98.51% of all LL-cache accesses  (44.52%)
       138,020,805      L2-loads-stores                                               (44.49%)
       138,522,404      LLC-loads                                                     (44.45%)
       135,902,197      LLC-loads-misses          #   98.35% of all LL-cache accesses  (44.42%)
       138,122,462      LLC-loads-stores                                              (44.38%)

CodePudding user response：

Your arrays are large and not aligned by 64, since you let std::vector<> allocate them. Using 64-byte vectors, every misaligned load will span a boundary between two 64-byte cache lines. (And you'll trip over the page-split at the end of every 4k page, although that's rare enough in sequential access to not explain this.) Unlike with 32-byte load/store where only every other vector will be a cache-line split.

(Glibc's malloc / new for large allocations typically keeps the first 16 bytes for bookkeeping, so the address it returns is 16 bytes past the start of a page, always misaligned by 32 and 64, always creating the worst case.)

512-bit vectors (on Skylake/Cascade Lake at least) are known to slow down with misaligned 64-byte loads/stores (more than AVX1/2 code with misaligned 32-byte ops). Even when arrays are so large that you'd expect it to just bottleneck on DRAM bandwidth and have time to sort out any misalignment penalties inside the core while waiting for cache lines to arrive.

Single-core DRAM bandwidth on a big Xeon is pretty low vs. a "client" CPU, especially for Skylake-family. (The mesh interconnect was new in that generation, and it's lower than in Broadwell Xeon. Apparently Ice Lake Xeon made a big improvement to max per-core DRAM bandwidth.) So even scalar code is able to saturate memory bandwidth.

(Or perhaps batch=7 was auto-vectorizing with -mprefer-vector-width=256 after fully unrolling the inner loop? No, it wasn't even inlining your loop, and not unswitching that loop into while(full vector left) vector; / while(any left) scalar;, so you have pretty nasty asm that does a lot of branching for each vector and scalar.)

But for some reason code that only ever uses 64-byte loads and stores can't max out one core's bandwidth. But your experiment shows that even a pattern of 1 vector 1 scalar can help (batch=9), assuming that compiled to match the source.

I don't know why; maybe the load execution units run out of split buffers for handling loads that need data from two cache lines. (Perf event ld_blocks.no_sr). But the scalar loads don't need a split buffer entry because they're always naturally aligned (to 8 bytes). So they can execute if dispatched, maybe triggering fetch of cache lines sooner.

(HW prefetch doesn't work across 4k page boundaries where physical memory might be discontiguous; the L2 streamer only sees physical addresses. So a demand load into the next 4k page can get HW prefetch started early enough to max out DRAM bandwidth to L2, where maybe that wasn't happening if later split vector loads weren't happening. 4k boundaries apply even if using 2M transparent hugepages; the hardware prefetcher doesn't get told that the fetches are part of a contiguous hugepage.)

Batch=9 also makes one of every eight vectors aligned, which might help slightly.

These are wild guesses about microarchitectural causes, not backed up by any performance experiments to test these hypotheses.

Testing with aligned buffers

If you want to at least test that it's misalignment responsible for the whole thing, either look into using a custom allocator for std::vector<int64_t, my_aligned_allocator> and/or std::vector<double, my_aligned_allocator>. (Modern approach to making std::vector allocate aligned memory). This is a good bet for production use, as it then works the same way as std::vector<int64_t>, although the 2nd template parameter makes it not type compatible.

For a quick experiment, make them std::vector<__m512i> and/or <__m512d> and change the loop code. (And compile with at least C 17 to make the standard library respect alignof(T).) (Useful to see whether source or destination misalignment is the critical factor, or both.) For batch=8 you can directly loop over the vectors. In the general case you'll need to static_cast<char*>(src->data()) and do the appropriate pointer math if you want to test this way. GNU C might define behaviour of pointing an double* into a __m512d because it happens to be defined in terms of double, but there are examples of pointing an int* at a __m256i not working as hoped. For a performance experiment, you can just check the asm and see if it's sane.

(Also you'd want to check that the compiler unrolled that inner loop, not actually branching inside a loop.)

Or use aligned_alloc to get raw storage instead of std::vector. But then you'd need to write to both arrays yourself to avoid page faults being part of the timed region for the first test, like std::vector's constructor does. (Idiomatic way of performance evaluation?) (std::vector is annoying when you don't want to write memory before your SIMD loop, since using .emplace_back is a pain with SIMD intrinsics. Not to mention that it sucks at growing, unable to use realloc in most C implementations to sometimes avoid having to copy.)

Or instead of writing an init loop or memset, do a warm-up pass? Good idea anyway for AVX-512 to make sure the 512-bit execution units are warmed up, and the CPU is at a frequency where it's able to run 512-bit FP instructions at the lowish throughput needed. (SIMD instructions lowering CPU frequency)

(Maybe __attribute__((noinline,noipa)) on do_benchmark, although I don't think Clang knows GCC's noipa attribute = no inter-procedural analysis.)

CodePudding user response：

Why is this case so significantly slower, though?

You may have run into (L1) cache contention and eviction of cache lines when the size of batches results in a distance in memory between subsequent batches that is a multiple of the critical stride. See Section 9.2 (Cache organization) and Section 9.10 (Cache contention in large data structures) in Agner Fog's Optimizing software in C [emphasis mine]:

9.2 Cache organization

[...]

Most caches are organized into lines and sets. [...] Reading or writing a variable from address 0x2710 will cause the cache to load the entire 64 or 0x40 bytes from address 0x2700 to 0x273F into one of the four cache lines from set 0x1C. If the program afterwards reads or writes to any other address in this range then the value is already in the cache so we do not have to wait for another memory access.

Assume that a program reads from address 0x2710 and later reads from addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. These addresses all belong to set number 0x1C. There are only four cache lines in each set. If the cache always chooses the least recently used cache line then the line that covered the address range from 0x2700 to 0x273F will be evicted when we read from 0x4700. Reading again from address 0x2710 will cause a cache miss. But if the program had read from different addresses with different set values then the line containing the address range from 0x2700 to 0x273F would still be in the cache. The problem only occurs because the addresses are spaced a multiple of 0x800 apart. I will call this distance the critical stride. Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines. The critical stride can be calculated as
(critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways)
9.10 Cache contentions in large data structures

It is not always possible to access a multidimensional array sequentially. Some applications (e.g. in linear algebra) require other access patterns. This can cause severe delays if the distance between rows in a big matrix happen to be equal to the critical stride [...]

Each cache line has to be reloaded eight times because it is evicted before we need the next element. [...]

Matrix size Total kilobytes Time per element

63x63 31 11.6

64x64 32 16.4

65x65 33 11.8

127x127 126 12.2

128x128 128 17.4

129x129 130 14.4

... ... ...

Table 9.1. Time for transposition of different size matrices, clock cycles per element.

The table shows that it takes 40% more time to transpose the matrix when the size of the matrix is a multiple of the level-1 cache size. This is because the critical stride is a multiple of the size of a matrix line. [...]

Matrix size	Total kilobytes	Time per element
63x63	31	11.6
64x64	32	16.4
65x65	33	11.8
127x127	126	12.2
128x128	128	17.4
129x129	130	14.4
...	...	...