enter image description here Code asm diff
I dont really understand what optimization this is, all I know is that it's really fast and I've tried many commands in the manual to no avail. Can anyone explain in detail what optimization this is and what command in GCC generates the same ASM as ICC or better?
CodePudding user response:
compilers generate code that is functionally equivalent, there is no reason to assume there is one perfect output for a target from one input. if two compilers produced the same output for a relatively decent sized function/project then one is derived from the other, legally or not.
In general no reason to assume any two compilers generate the same output or any two versions of the same compiler generates the same output. Amplify that by command line options that will change the output of the compiler.
In general the expectation is that for your code one compiler may produce "better" code depending on the definition of better. Smaller, or runs faster on one particular computer, operating system, etc.
gcc is a generic computer it does "okay" for each target but is not great for any target. Some tools that are designed from scratch to aim at one target/system CAN (but not necessarily) do better. And then there can be some cheating. Take some code like the possibly intentionally horribly written dhrystone and whetstone, etc. Then when X compiler that perhaps you already paid (five figures) for or are evaluating to pay for, does not produce code for dhrystone that is as fast as the free tool. Oh, sure, try this command line option for dhrystone. Hmm okay that does work much better (been there, seen this). gcc has been getting worse since versions 3.x.x/4.x.x. for various reasons I assume. I assume that part of it is the folks that truly worked at this level are dying off and being replaced with folks without the low level experience and skills. Processors are getting more complicated and gcc and others have more targets. But the volume of missed optimizations that older versions provided are increasing, and the size of the binaries with the same source and same settings are increasing, by a significant amount, not just a tiny bit for a decent sized project.
This is not a case of I need to get the wheel on the car and I have a choice of tools I can use to tighten the nut and it is the same result independent of tool.
no reason to expect that you can get any two compilers to generate the same output, even if the tools both generate assembly language, and generated the same sequence of instructions and data, no reason to assume that the assembly language itself from different assembly languages for that target, to different label names and spacing, function ordering, and other syntax that would make a diff difficult to deal with.
CodePudding user response:
I'm not sure there is an optimization option to make GCC do this optimization. Don't write loops that redo the same work 100k times if you don't want you program to spend time doing that.
Defeating benchmark loops can make compilers look good on benchmark, but AFAIK is often not useful in real-world code where something else happens between runs of the loop you want optimized.
ICC is defeating your benchmark repeat-loop by turning it into this
for (unsigned c = 0; c < arraySize; c)
{
if (data[c] >= 128)
for (unsigned i = 0; i < 100000; i)
sum = data[c];
}
The first step, swapping the inner and outer loops, is called loop interchange. Making one pass over the array is good for cache locality and enables further optimizations.
Turning for() if()
into if() for(){} else for(){}
is called loop unswitching. In this case, there is no "else" work to do; the only thing in the loop was if()sum =...
, so it becomes just an if controlling a repeated-addition loop.
ICC unrolls vectorizes that sum =
, strangely not just doing it with a multiply. Instead it does 100000
64-bit add operations. ymm0 holds _mm256_set1_epi64(data[c])
from vpbroadcastq
.
This inner loop only runs conditionally; it's worth branching if it's going to save 6250 iterations of this loop. (Only one pass over the array, one branch per element total, not 100k.)
..B1.9: # Preds ..B1.9 ..B1.8
add edx, 16 #20.2
vpaddq ymm4, ymm4, ymm0 #26.5
vpaddq ymm3, ymm3, ymm0 #26.5
vpaddq ymm2, ymm2, ymm0 #26.5
vpaddq ymm1, ymm1, ymm0 #26.5
cmp edx, 100000 #20.2
jb ..B1.9 # Prob 99% #20.2
Every iteration does 16 additions, 4 per instruction, unrolled by 4 into separate accumulators that are reduced to 1 and then hsummed after the loop. Unrolling lets Skylake and later run 3 vpaddq
per clock cycle.
By contrast, GCC does multiple passes over the array, inside the loop vectorizing to branchlessly do 8 compares:
.L85:
vmovdqa ymm4, YMMWORD PTR [rax] # load 8 ints
add rax, 32
vpcmpgtd ymm0, ymm4, ymm3 # signed-compare them against 128
vpand ymm0, ymm0, ymm4 # mask data[c] to 0 or data[c]
vpmovsxdq ymm2, xmm0 # sign-extend to 64-bit
vextracti128 xmm0, ymm0, 0x1
vpaddq ymm1, ymm2, ymm1 # and add
vpmovsxdq ymm0, xmm0 # sign-extend the high half and add it
vpaddq ymm1, ymm0, ymm1 # to the same accumulator
cmp rbx, rax
jne .L85
This is inside a repeat loop that makes multiple passes over the array, and might bottleneck on 1 shuffle uop per clock, like 8 elements per 3 clock cycles on Skylake.
So it just vectorized the inner if() sum =data[c]
like you'd expect, without defeating the repeat loop at all. Clang does similar, as in Why is processing an unsorted array the same speed as processing a sorted array with modern x86-64 clang?