I'm trying to make sure gcc vectorizes my loops. It turns out, that by using -march=znver1
(or -march=native
) gcc skips some loops even though they can be vectorized. Why does this happen?
In this code, the second loop, which multiplies each element by a scalar is not vectorised:
#include <stdio.h>
#include <inttypes.h>
int main() {
const size_t N = 1000;
uint64_t arr[N];
for (size_t i = 0; i < N; i)
arr[i] = 1;
for (size_t i = 0; i < N; i)
arr[i] *= 5;
for (size_t i = 0; i < N; i)
printf("%lu\n", arr[i]); // use the array so that it is not optimized away
}
gcc -O3 -fopt-info-vec-all -mavx2 main.c
:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: optimized: loop vectorized using 32 byte vectors
main.cpp:7:26: optimized: loop vectorized using 32 byte vectors
main.cpp:4:5: note: vectorized 2 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V4DI
main.cpp:15:1: note: ***** Skipping vector mode V32QI, which would repeat the analysis for V4DI
gcc -O3 -fopt-info-vec-all -march=znver1 main.c
:
main.cpp:13:26: missed: couldn't vectorize loop
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:10:26: missed: couldn't vectorize loop
main.cpp:10:26: missed: not vectorized: unsupported data-type
main.cpp:7:26: optimized: loop vectorized using 16 byte vectors
main.cpp:4:5: note: vectorized 1 loops in function.
main.cpp:14:15: missed: statement clobbers memory: printf ("%lu\n", _3);
main.cpp:15:1: note: ***** Analysis failed with vector mode V2DI
main.cpp:15:1: note: ***** Skipping vector mode V16QI, which would repeat the analysis for V2DI
-march=znver1
includes -mavx2
, so I think gcc chooses not to vectorise it for some reason:
~ $ gcc -march=znver1 -Q --help=target
The following options are target specific:
-m128bit-long-double [enabled]
-m16 [disabled]
-m32 [disabled]
-m3dnow [disabled]
-m3dnowa [disabled]
-m64 [enabled]
-m80387 [enabled]
-m8bit-idiv [disabled]
-m96bit-long-double [disabled]
-mabi= sysv
-mabm [enabled]
-maccumulate-outgoing-args [disabled]
-maddress-mode= long
-madx [enabled]
-maes [enabled]
-malign-data= compat
-malign-double [disabled]
-malign-functions= 0
-malign-jumps= 0
-malign-loops= 0
-malign-stringops [enabled]
-mamx-bf16 [disabled]
-mamx-int8 [disabled]
-mamx-tile [disabled]
-mandroid [disabled]
-march= znver1
-masm= att
-mavx [enabled]
-mavx2 [enabled]
-mavx256-split-unaligned-load [disabled]
-mavx256-split-unaligned-store [enabled]
-mavx5124fmaps [disabled]
-mavx5124vnniw [disabled]
-mavx512bf16 [disabled]
-mavx512bitalg [disabled]
-mavx512bw [disabled]
-mavx512cd [disabled]
-mavx512dq [disabled]
-mavx512er [disabled]
-mavx512f [disabled]
-mavx512ifma [disabled]
-mavx512pf [disabled]
-mavx512vbmi [disabled]
-mavx512vbmi2 [disabled]
-mavx512vl [disabled]
-mavx512vnni [disabled]
-mavx512vp2intersect [disabled]
-mavx512vpopcntdq [disabled]
-mavxvnni [disabled]
-mbionic [disabled]
-mbmi [enabled]
-mbmi2 [enabled]
-mbranch-cost=<0,5> 3
-mcall-ms2sysv-xlogues [disabled]
-mcet-switch [disabled]
-mcld [disabled]
-mcldemote [disabled]
-mclflushopt [enabled]
-mclwb [disabled]
-mclzero [enabled]
-mcmodel= [default]
-mcpu=
-mcrc32 [disabled]
-mcx16 [enabled]
-mdispatch-scheduler [disabled]
-mdump-tune-features [disabled]
-menqcmd [disabled]
-mf16c [enabled]
-mfancy-math-387 [enabled]
-mfentry [disabled]
-mfentry-name=
-mfentry-section=
-mfma [enabled]
-mfma4 [disabled]
-mforce-drap [disabled]
-mforce-indirect-call [disabled]
-mfp-ret-in-387 [enabled]
-mfpmath= sse
-mfsgsbase [enabled]
-mfunction-return= keep
-mfused-madd -ffp-contract=fast
-mfxsr [enabled]
-mgeneral-regs-only [disabled]
-mgfni [disabled]
-mglibc [enabled]
-mhard-float [enabled]
-mhle [disabled]
-mhreset [disabled]
-miamcu [disabled]
-mieee-fp [enabled]
-mincoming-stack-boundary= 0
-mindirect-branch-register [disabled]
-mindirect-branch= keep
-minline-all-stringops [disabled]
-minline-stringops-dynamically [disabled]
-minstrument-return= none
-mintel-syntax -masm=intel
-mkl [disabled]
-mlarge-data-threshold=<number> 65536
-mlong-double-128 [disabled]
-mlong-double-64 [disabled]
-mlong-double-80 [enabled]
-mlwp [disabled]
-mlzcnt [enabled]
-mmanual-endbr [disabled]
-mmemcpy-strategy=
-mmemset-strategy=
-mmitigate-rop [disabled]
-mmmx [enabled]
-mmovbe [enabled]
-mmovdir64b [disabled]
-mmovdiri [disabled]
-mmpx [disabled]
-mms-bitfields [disabled]
-mmusl [disabled]
-mmwaitx [enabled]
-mneeded [disabled]
-mno-align-stringops [disabled]
-mno-default [disabled]
-mno-fancy-math-387 [disabled]
-mno-push-args [disabled]
-mno-red-zone [disabled]
-mno-sse4 [disabled]
-mnop-mcount [disabled]
-momit-leaf-frame-pointer [disabled]
-mpc32 [disabled]
-mpc64 [disabled]
-mpc80 [disabled]
-mpclmul [enabled]
-mpcommit [disabled]
-mpconfig [disabled]
-mpku [disabled]
-mpopcnt [enabled]
-mprefer-avx128 -mprefer-vector-width=128
-mprefer-vector-width= 128
-mpreferred-stack-boundary= 0
-mprefetchwt1 [disabled]
-mprfchw [enabled]
-mptwrite [disabled]
-mpush-args [enabled]
-mrdpid [disabled]
-mrdrnd [enabled]
-mrdseed [enabled]
-mrecip [disabled]
-mrecip=
-mrecord-mcount [disabled]
-mrecord-return [disabled]
-mred-zone [enabled]
-mregparm= 6
-mrtd [disabled]
-mrtm [disabled]
-msahf [enabled]
-mserialize [disabled]
-msgx [disabled]
-msha [enabled]
-mshstk [disabled]
-mskip-rax-setup [disabled]
-msoft-float [disabled]
-msse [enabled]
-msse2 [enabled]
-msse2avx [disabled]
-msse3 [enabled]
-msse4 [enabled]
-msse4.1 [enabled]
-msse4.2 [enabled]
-msse4a [enabled]
-msse5 -mavx
-msseregparm [disabled]
-mssse3 [enabled]
-mstack-arg-probe [disabled]
-mstack-protector-guard-offset=
-mstack-protector-guard-reg=
-mstack-protector-guard-symbol=
-mstack-protector-guard= tls
-mstackrealign [disabled]
-mstringop-strategy= [default]
-mstv [enabled]
-mtbm [disabled]
-mtls-dialect= gnu
-mtls-direct-seg-refs [enabled]
-mtsxldtrk [disabled]
-mtune-ctrl=
-mtune= znver1
-muclibc [disabled]
-muintr [disabled]
-mvaes [disabled]
-mveclibabi= [default]
-mvect8-ret-in-mem [disabled]
-mvpclmulqdq [disabled]
-mvzeroupper [enabled]
-mwaitpkg [disabled]
-mwbnoinvd [disabled]
-mwidekl [disabled]
-mx32 [disabled]
-mxop [disabled]
-mxsave [enabled]
-mxsavec [enabled]
-mxsaveopt [enabled]
-mxsaves [enabled]
Known assembler dialects (for use with the -masm= option):
att intel
Known ABIs (for use with the -mabi= option):
ms sysv
Known code models (for use with the -mcmodel= option):
32 kernel large medium small
Valid arguments to -mfpmath=:
387 387 sse 387,sse both sse sse 387 sse,387
Known indirect branch choices (for use with the -mindirect-branch=/-mfunction-return= options):
keep thunk thunk-extern thunk-inline
Known choices for return instrumentation with -minstrument-return=:
call none nop5
Known data alignment choices (for use with the -malign-data= option):
abi cacheline compat
Known vectorization library ABIs (for use with the -mveclibabi= option):
acml svml
Known address mode (for use with the -maddress-mode= option):
long short
Known preferred register vector length (to use with the -mprefer-vector-width= option):
128 256 512 none
Known stack protector guard (for use with the -mstack-protector-guard= option):
global tls
Valid arguments to -mstringop-strategy=:
byte_loop libcall loop rep_4byte rep_8byte rep_byte unrolled_loop vector_loop
Known TLS dialects (for use with the -mtls-dialect= option):
gnu gnu2
Known valid arguments for -march= option:
i386 i486 i586 pentium lakemont pentium-mmx winchip-c6 winchip2 c3 samuel-2 c3-2 nehemiah c7 esther i686 pentiumpro pentium2 pentium3 pentium3m pentium-m pentium4 pentium4m prescott nocona core2 nehalem corei7 westmere sandybridge corei7-avx ivybridge core-avx-i haswell core-avx2 broadwell skylake skylake-avx512 cannonlake icelake-client rocketlake icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake bonnell atom silvermont slm goldmont goldmont-plus tremont knl knm intel geode k6 k6-2 k6-3 athlon athlon-tbird athlon-4 athlon-xp athlon-mp x86-64 x86-64-v2 x86-64-v3 x86-64-v4 eden-x2 nano nano-1000 nano-2000 nano-3000 nano-x2 eden-x4 nano-x4 k8 k8-sse3 opteron opteron-sse3 athlon64 athlon64-sse3 athlon-fx amdfam10 barcelona bdver1 bdver2 bdver3 bdver4 znver1 znver2 znver3 btver1 btver2 generic native
Known valid arguments for -mtune= option:
generic i386 i486 pentium lakemont pentiumpro pentium4 nocona core2 nehalem sandybridge haswell bonnell silvermont goldmont goldmont-plus tremont knl knm skylake skylake-avx512 cannonlake icelake-client icelake-server cascadelake tigerlake cooperlake sapphirerapids alderlake rocketlake intel geode k6 athlon k8 amdfam10 bdver1 bdver2 bdver3 bdver4 btver1 btver2 znver1 znver2 znver3
I also tried clang and in both cases the loops are vectorised by, I believe, 32 byte vectors:
remark: vectorized loop (vectorization width: 4, interleaved count: 4)
I'm using gcc 11.2.0
Edit: As requested by Peter Cordes I realised I was actually benchmarking with a multiplication by 4 for some time.
Makefile:
all:
gcc -O3 -mavx2 main.c -o 3
gcc -O3 -march=znver2 main.c -o 32
gcc -O3 -march=znver2 main.c -mprefer-vector-width=128 -o 32128
gcc -O3 -march=znver1 main.c -o 31
gcc -O2 -mavx2 main.c -o 2
gcc -O2 -march=znver2 main.c -o 22
gcc -O2 -march=znver2 main.c -mprefer-vector-width=128 -o 22128
gcc -O2 -march=znver1 main.c -o 21
hyperfine -r5 ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
clean:
rm ./3 ./32 ./32128 ./31 ./2 ./22 ./22128 ./21
Code:
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <time.h>
int main() {
const size_t N = 500;
uint64_t arr[N];
for (size_t i = 0; i < N; i)
arr[i] = 1;
for (int j = 0; j < 20000000; j)
for (size_t i = 0; i < N; i)
arr[i] *= 4;
srand(time(0));
printf("%lu\n", arr[rand() % N]); // use the array so that it is not optimized away
}
N = 500, arr[i] *= 4
:
Benchmark 1: ./3
Time (mean ± σ): 1.780 s ± 0.011 s [User: 1.778 s, System: 0.000 s]
Range (min … max): 1.763 s … 1.791 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.785 s ± 0.016 s [User: 1.783 s, System: 0.000 s]
Range (min … max): 1.773 s … 1.810 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.740 s ± 0.026 s [User: 1.737 s, System: 0.000 s]
Range (min … max): 1.724 s … 1.785 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.757 s ± 0.022 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.727 s … 1.785 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.467 s ± 0.031 s [User: 3.462 s, System: 0.000 s]
Range (min … max): 3.443 s … 3.519 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.475 s ± 0.028 s [User: 3.469 s, System: 0.001 s]
Range (min … max): 3.447 s … 3.512 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.464 s ± 0.034 s [User: 3.459 s, System: 0.001 s]
Range (min … max): 3.431 s … 3.509 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.465 s ± 0.013 s [User: 3.460 s, System: 0.001 s]
Range (min … max): 3.443 s … 3.475 s 5 runs
N = 500, arr[i] *= 5
:
Benchmark 1: ./3
Time (mean ± σ): 1.789 s ± 0.004 s [User: 1.786 s, System: 0.001 s]
Range (min … max): 1.783 s … 1.793 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.772 s ± 0.017 s [User: 1.769 s, System: 0.000 s]
Range (min … max): 1.755 s … 1.800 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.911 s ± 0.023 s [User: 2.907 s, System: 0.001 s]
Range (min … max): 2.880 s … 2.943 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 2.924 s ± 0.013 s [User: 2.921 s, System: 0.000 s]
Range (min … max): 2.906 s … 2.934 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.850 s ± 0.029 s [User: 3.846 s, System: 0.000 s]
Range (min … max): 3.823 s … 3.896 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.816 s ± 0.036 s [User: 3.812 s, System: 0.000 s]
Range (min … max): 3.777 s … 3.855 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.813 s ± 0.026 s [User: 3.809 s, System: 0.000 s]
Range (min … max): 3.780 s … 3.834 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.783 s ± 0.010 s [User: 3.779 s, System: 0.000 s]
Range (min … max): 3.773 s … 3.798 s 5 runs
N = 512, arr[i] *= 4
Benchmark 1: ./3
Time (mean ± σ): 1.849 s ± 0.015 s [User: 1.847 s, System: 0.000 s]
Range (min … max): 1.831 s … 1.873 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.846 s ± 0.013 s [User: 1.844 s, System: 0.001 s]
Range (min … max): 1.832 s … 1.860 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 1.756 s ± 0.012 s [User: 1.754 s, System: 0.000 s]
Range (min … max): 1.744 s … 1.771 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 1.788 s ± 0.012 s [User: 1.785 s, System: 0.001 s]
Range (min … max): 1.774 s … 1.801 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 3.476 s ± 0.015 s [User: 3.472 s, System: 0.001 s]
Range (min … max): 3.458 s … 3.494 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.449 s ± 0.002 s [User: 3.446 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.452 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.456 s ± 0.007 s [User: 3.453 s, System: 0.000 s]
Range (min … max): 3.446 s … 3.462 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.547 s ± 0.044 s [User: 3.542 s, System: 0.001 s]
Range (min … max): 3.482 s … 3.600 s 5 runs
N = 512, arr[i] *= 5
Benchmark 1: ./3
Time (mean ± σ): 1.847 s ± 0.013 s [User: 1.845 s, System: 0.000 s]
Range (min … max): 1.836 s … 1.863 s 5 runs
Benchmark 2: ./32
Time (mean ± σ): 1.830 s ± 0.007 s [User: 1.827 s, System: 0.001 s]
Range (min … max): 1.820 s … 1.837 s 5 runs
Benchmark 3: ./32128
Time (mean ± σ): 2.983 s ± 0.017 s [User: 2.980 s, System: 0.000 s]
Range (min … max): 2.966 s … 3.012 s 5 runs
Benchmark 4: ./31
Time (mean ± σ): 3.026 s ± 0.039 s [User: 3.021 s, System: 0.001 s]
Range (min … max): 2.989 s … 3.089 s 5 runs
Benchmark 5: ./2
Time (mean ± σ): 4.000 s ± 0.021 s [User: 3.994 s, System: 0.001 s]
Range (min … max): 3.982 s … 4.035 s 5 runs
Benchmark 6: ./22
Time (mean ± σ): 3.940 s ± 0.041 s [User: 3.934 s, System: 0.001 s]
Range (min … max): 3.890 s … 3.981 s 5 runs
Benchmark 7: ./22128
Time (mean ± σ): 3.928 s ± 0.032 s [User: 3.922 s, System: 0.001 s]
Range (min … max): 3.898 s … 3.979 s 5 runs
Benchmark 8: ./21
Time (mean ± σ): 3.908 s ± 0.029 s [User: 3.904 s, System: 0.000 s]
Range (min … max): 3.879 s … 3.954 s 5 runs
I think the run where -O2 -march=znver1
was the same speed as -O3 -march=znver1
was a mistake on my part with the naming of the files, I had not created the makefile back then yet, I was using my shell's history.
CodePudding user response:
The default -mtune=generic
has -mprefer-vector-width=256
, and -mavx2
doesn't change that.
znver1 implies -mprefer-vector-width=128
, because that's all the native width of the HW. An instruction using 32-byte YMM vectors decodes to at least 2 uops, more if it's a lane-crossing shuffle. For simple vertical SIMD like this, 32-byte vectors would be ok; the pipeline handles 2-uop instructions efficiently. (And I think is 6 uops wide but only 5 instructions wide, so max front-end throughput isn't available using only 1-uop instructions). But when vectorization would require shuffling, e.g. with arrays of different element widths, GCC code-gen can get messier with 256-bit or wider.
And vmovdqa ymm0, ymm1
mov-elimination only works on the low 128-bit half on Zen1. Also, normally using 256-bit vectors would imply one should use vzeroupper
afterwards, to avoid performance problems on other CPUs (but not Zen1).
I don't know how Zen1 handles misaligned 32-byte loads/stores where each 16-byte half is aligned but in separate cache lines. If that performs well, GCC might want to consider increasing the znver1 -mprefer-vector-width
to 256. But wider vectors means more cleanup code if the size isn't known to be a multiple of the vector width.
Ideally GCC would be able to detect easy cases like this and use 256-bit vectors there. (Pure vertical, no mixing of element widths, constant size that's am multiple of 32 bytes.) At least on CPUs where that's fine: znver1, but not bdver2 for example where 256-bit stores are always slow due to a CPU design bug.
You can see the result of this choice in the way it vectorizes your first loop, the memset-like loop, with a vmovdqu [rdx], xmm0
. https://godbolt.org/z/E5Tq7Gfzc
So given that GCC has decided to only use 128-bit vectors, which can only hold two uint64_t
elements, it (rightly or wrongly) decides it wouldn't be worth using vpsllq
/ vpaddd
to implement qword *5
as (v<<2) v
, vs. doing it with integer in one LEA instruction.
Almost certainly wrongly in this case, since it still requires a separate load and store for every element or pair of elements. (And loop overhead since GCC's default is not to unroll except with PGO, -fprofile-use
. SIMD is like loop unrolling, especially on a CPU that handles 256-bit vectors as 2 separate uops.)
I'm not sure exactly what GCC means by "not vectorized: unsupported data-type". x86 doesn't have a SIMD uint64_t
multiply instruction until AVX-512, so perhaps GCC assigns it a cost based on the general case of having to emulate it with multiple 32x32 => 64-bit pmuludq
instructions and a bunch of shuffles. And it's only after it gets over that hump that it realizes that it's actually quite cheap for a constant like 5
with only 2 set bits?
That would explain GCC's decision-making process here, but I'm not sure it's exactly the right explanation. Still, these kinds of factors are what happen in a complex piece of machinery like a compiler. A skilled human can easily make smarter choices, but compilers just do sequences of optimization passes that don't always consider the big picture and all the details at the same time.
-mprefer-vector-width=256
doesn't help:
Not vectorizing uint64_t *= 5
seems to be a GCC9 regression
(The benchmarks in the question confirm that an actual Zen1 CPU gets a nearly 2x speedup, as expected from doing 2x uint64 in 6 uops vs. 1x in 5 uops with scalar. Or 4x uint64_t in 10 uops with 256-bit vectors, including two 128-bit stores which will be the throughput bottleneck along with the front-end.)
Even with -march=znver1 -O3 -mprefer-vector-width=256
, we don't get the *= 5
loop vectorized with GCC9, 10, or 11, or current trunk. As you say, we do with -march=znver2
. https://godbolt.org/z/dMTh7Wxcq
We do get vectorization with those options for uint32_t
(even leaving the vector width at 128-bit). Scalar would cost 4 operations per vector uop (not instruction), regardless of 128 or 256-bit vectorization on Zen1, so this doesn't tell us whether *=
is what makes the cost-model decide not to vectorize, or just the 2 vs. 4 elements per 128-bit internal uop.
With uint64_t
, changing to arr[i] = arr[i]<<2;
still doesn't vectorize, but arr[i] <<= 1;
does. (https://godbolt.org/z/6PMn93Y5G). Even arr[i] <<= 2;
and arr[i] = 123
in the same loop vectorize, to the same instructions that GCC thinks aren't worth it for vectorizing *= 5
, just different operands, constant instead of the original vector again. (Scalar could still use one LEA). So clearly the cost-model isn't looking as far as final x86 asm machine instructions, but I don't know why arr[i] = arr[i]
would be considered more expensive than arr[i] <<= 1;
which is exactly the same thing.
GCC8 does vectorize your loop, even with 128-bit vector width: https://godbolt.org/z/5o6qjc7f6
# GCC8.5 -march=znver1 -O3 (-mprefer-vector-width=128)
.L12: # do{
vmovups xmm1, XMMWORD PTR [rsi] # 16-byte load
add rsi, 16 # ptr = 2 elements
vpsllq xmm0, xmm1, 2 # v << 2
vpaddq xmm0, xmm0, xmm1 # tmp = v
vmovups XMMWORD PTR [rsi-16], xmm0 # store
cmp rax, rsi
jne .L12 # } while(p != endp)
With -march=znver1 -mprefer-vector-width=256
, doing the store as two 16-byte halves with vmovups xmm
/ vextracti128
is Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd? znver1 implies -mavx256-split-unaligned-store
(which affects every store when GCC doesn't know for sure that it is aligned. So it costs extra instructions even when data does happen to be aligned).
znver1 doesn't imply -mavx256-split-unaligned-load
, though, so GCC is willing to fold loads as memory source operands into ALU operations in code where that's useful.