Here it is, my first question 10 years after I started using this website. Most classic questions have been already asked and answered so I never felt the need to create an account until today. So first of all thank you for sharing knowledge and making our lives easier.
Also hello everyone since apparently I can't say hello at the beginning of the message.
My problem
I have 2 adjacent buffers of bytes of identical size (around 20 MB each). I just want to count the differences between them.
My question
How much time this loop should take to run on a 4.8GHz Intel I7 9700K with 3600Mhz RAM ? How do we compute max theoritical speed ?
What I tried
uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
uint64_t diffFound = 0;
for(uint64_t byte = 0; byte < commonSize; byte)
diffFound = static_cast<uint64_t>(buffer[byte] != buffer[byte commonSize]);
return diffFound;
}
It takes 11ms on my PC (9700K 4.8Ghz RAM 3600 Windows 10 Clang 14.0.6 -O3 MinGW ) and I feel it is too slow and that I am missing something.
40MB should take less than 2ms to be read on the CPU (my RAM bandwidth is between 20 and 30GB/s)
I don't know how to count cycles required to execute one iteration (especially because CPUs are superscalar nowadays) so if I assume 1 cycle per op and if I don't mess up my counting, it should be 10 ops per iteration -> 200 million ops -> at 4.8 Ghz with only one execution unit -> 40ms so obviously I am wrong on how to compute the amount of cycles per loop.
Fun fact I tried on linux PopOS GCC 11.2 -O3 and it ran at 4.5ms. Why such a difference ?
Here are the dissassemblies vectorised and scalar produced by clang. I am trying to add GCC but SO keeps crying.
CodePudding user response:
Correct me if I am wrong but the answer seems to be
- -march=native for the win.
- the scalar version of the code was CPU bottlenecked and not RAM bottlenecked
- use uica.uops.info to have an estimate of the cycles per loop
I will try to write my own AVX code to compare.
Details
After an afternoon tikering around with the suggestions. Here is what I found with Clang:
-O1 around 10ms, scalar code
-O3 enables SSE2 and is as slow as O1, maybe poor assembly code
-O3 -march=westmere enables also SSE2 but is faster (7ms)
-O3 -march=native enables AVX -> 2.5ms and we are probably RAM bandwidth limited (close to the theoritical speed)
The scalar 10ms makes sense now because according to that awesome tool uica.uops.info it takes
- 2.35 cycles per loop
- 47millions cycles for the whole comparison (20 million iterations)
- Processor is clocked at 4.8GHz meaning it should take around 9.8ms and it is close to what is measured.
g seems to generate better default code when no flags are added
- O1 11ms
- O2 scalar still but 9ms
- O3 SSE 4.5ms
- O3 -march=westmere 7ms like Clang
- O3 -march=native 3.4ms, slightly slower than clang
CodePudding user response:
Yes, if your data is not hot in cache, even SSE2 should keep up with memory bandwidth. Compare-and-sum of 32 compare results per cycle (from two 32-byte loads) is totally possible if data is hot in L1d cache, or whatever bandwidth outer levels of cache can provide.
If not, the compiler did a bad job. That's unfortunately common for problems like this reducing into a wider variable; compilers don't know good vectorization strategies for summing bytes, especially compare-result bytes that must be 0/-1. They probably widen to 64-bit with pmovsxbq
right away (or even worse if SSE4.1 instructions aren't available).
So even -O3 -march=native
doesn't help much; this is a big missed-optimization; hopefully GCC and clang will learn how to vectorize this kind of loop at some point, summing compare results probably comes up in enough codebases to be worth recognizing that pattern.
The efficient way is to use psadbw
to sum horizontally into qwords. But only after an inner loop does some iterations of vsum -= cmp(p, q)
, subtracting 0 or -1 to increment a counter or not. 8-bit elements can do 255 iterations of that without risk of overflow. And with unrolling for multiple vector accumulators, that's many vectors of 32 bytes each, so you don't have to break out of that inner loop very often.
See How to count character occurrences using SIMD for manually-vectorized AVX2 code. (And one answer has a Godbolt link to an SSE2 version.) Summing the compare results is the same problem as that, but you're loading two vectors to feed pcmpeqb instead of broadcasting one byte outside the loop to find occurrences of a single char.
An answer there has benchmarks that report 28 GB/s for AVX2, 23 GB/s for SSE2, on an i7-6700 Skylake (at only 3.4GHz, maybe they disabled turbo or are just reporting the rated speed. DRAM speed not mentioned.)
I'd expect 2 input streams of data to achieve about the same sustained bandwidth as one.
This is more interesting to optimize if you benchmark repeated passes over smaller arrays that fit in L2 cache, then efficiency of your ALU instructions matters. (The strategy in the answers on that question are pretty good and well tuned for that case.)
Fast counting the number of equal bytes between two arrays is an older Q&A using a worse strategy, not using psadbw
to sum bytes to 64-bit. (But not as bad as GCC/clang, still hsumming as it widens to 32-bit.)
Multiple threads/cores will barely help on a modern desktop, especially at high core clocks like yours. Memory latency is low enough and each core has enough buffers to keep enough requests in flight that it can nearly saturate dual-channel DRAM controllers.
On a big Xeon, that would be very different; you need most of the cores to achieve peak aggregate bandwidth, even for just memcpy or memset so there's zero ALU work, just loads/stores. The higher latency means a single core has much less memory bandwidth available than on a desktop (even in an absolute sense, let alone as a percentage of 6 channels instead of 2). See also Enhanced REP MOVSB for memcpy and Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?