I'm trying to verify the conclusion that two fuseable pairs can be decoded in the same clock cycle, using my Intel i7-10700 and ubuntu 20.04.
The test code is arranged like below, and it is copied like 8000 times to avoid the influence of LSD and DSB (to use MITE mostly).
ALIGN 32
.loop_1:
dec ecx
jge .loop_2
.loop_2:
dec ecx
jge .loop_3
.loop_3:
dec ecx
jge .loop_4
.loop_4:
.loop_5:
dec ecx
jge .loop_6
The test result tells that only one pair is fused in a single cycle. ( r479 div r1002479 )
Performance counter stats for process id '22597':
120,459,876,711 cycles
35,514,146,968 instructions # 0.29 insn per cycle
17,792,584,278 r479 # r479: Number of uops delivered
# to Instruction Decode Queue (IDQ) from MITE path
50,968,497 r4002479
17,756,894,879 r1002479 # r1002479: Cycles MITE is delivering any Uop
26.444208448 seconds time elapsed
I don't think Agner's conclusion is wrong. Therefore, is there something wrong with my perf usage, or did I fail to find insights in the code?
CodePudding user response:
On Haswell and later, yes. On Ivy Bridge and earlier, no.
On Ice Lake and later, Agner Fog says macro-fusion is done right after decode, instead of in the decoders which required the pre-decoders to send the right chunks of x86 machine code to decoders accordingly. (And Ice Lake has slightly different restrictions: Instructions with a memory operand cannot fuse, unlike previous CPU models. Instructions with an immediate operand can fuse.) So on Ice Lake, macro-fusion doesn't let the decoders handle more than 5 instructions per clock.
Wikichip claims that only 1 macro-fusion per clock is possible on Ice Lake, but that's probably incorrect. Harold tested with my microbenchmark on Rocket Lake and found the same results as Skylake. (Rocket Lake uses a Cypress Cove core, a variant of Sunny Cove back-ported to a 14nm process, so it's likely that it's the same as Ice Lake in this respect.)
Your results indicate that uops_issued.any
is about half instructions
, therefore you are seeing macro-fusion of most pairs. (You could also look at the uops_retired.macro_fused
perf event. BTW, modern perf
has symbolic names for most uarch-specific events: use perf list
to see them.)
The decoders will still produce up-to-four or even five uops per clock on Skylake-derived microarchitectures, though, even if they only make two macro-fusions. You didn't look at how many cycles MITE is active, so you can't see that execution stalls most of the time, until there's room in the ROB / RS for an issue-group of 4 uops. And that opens up space in the IDQ for a decode group from MITE.
You have three other bottlenecks in your loop:
Loop-carried dependency through
dec ecx
: only 1/clock because eachdec
has to wait for the result of the previous to be ready.Only one taken branch can execute per cycle (on port 6), and
dec
/jge
is taken almost every time, except for 1 in 2^32 when ECX was 0 before the dec.
The other branch execution unit on port 0 only handles predicted-not-taken branches. https://www.realworldtech.com/haswell-cpu/4/ shows the layout but doesn't mention that limitation; Agner Fog's microarch guide does.Branch prediction: even jumping to the next instruction, which is architecturally a NOP, is not special cased by the CPU. Slow jmp-instruction (Because there's no reason for real code to do this, except for
call 0
/pop
which is special cased at least for the return-address predictor stack.)This is why you're executing at significantly less than one instruction per clock, let alone one uop per clock.
Working demo of 2 fusions per clock
Surprisingly to me, MITE didn't go on to decode a separate test
and jcc
in the same cycle as it made two fusions. I guess the decoders are optimized for filling the uop cache. (A similar effect on Sandybridge / IvyBridge is that if the final uop of a decode-group is potentially fusable, like dec
, decoders will only produce 3 uops that cycle, in anticipation of maybe fusing the dec
next cycle. That's true at least on SnB/IvB where the decoders can only make 1 fusion per cycle, and will decode separate ALU jcc uops if there is another pair in the same decode group. Here, SKL is choosing not to decode a separate test
uop (and jcc
and another test
) after making two fusions.)
global _start
_start:
mov ecx, 100000000
ALIGN 32
.loop:
%rep 399 ; the loop branch makes 400 total
test ecx, ecx
jz .exit_loop ; many of these will be 6-byte jcc rel32
%endrep
dec ecx
jnz .loop
.exit_loop:
mov eax, 231
syscall ; exit_group(EDI)
On i7-6700k Skylake, perf counters for user-space only:
$ nasm -felf64 fusion.asm && ld fusion.o -o fusion # static executable
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.all_mite_cycles_any_uops,idq.mite_uops -r2 ./fusion
Performance counter stats for './fusion' (2 runs):
5,165.34 msec task-clock # 1.000 CPUs utilized ( - 0.01% )
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
1 page-faults # 0.194 /sec
20,130,230,894 cycles # 3.897 GHz ( - 0.04% )
80,000,001,586 instructions # 3.97 insn per cycle ( - 0.00% )
40,000,677,865 uops_issued.any # 7.744 G/sec ( - 0.00% )
40,000,602,728 uops_executed.thread # 7.744 G/sec ( - 0.00% )
20,100,486,534 idq.all_mite_cycles_any_uops # 3.891 G/sec ( - 0.00% )
40,000,261,852 idq.mite_uops # 7.744 G/sec ( - 0.00% )
5.165605 - 0.000716 seconds time elapsed ( - 0.01% )
Not-taken branches aren't a bottleneck, perhaps because my loop is big enough to defeat the DSB (uop cache), but not too big to defeat branch prediction. (Actually, the JCC erratum mitigation on Skylake will definitely defeat the DSB: if everything is a macro-fused branch, there will be one touching the end of every 32-byte region. Only if we start introducing NOPs or other instructions between branches will the uop cache be able to operate.)
We can see that everything was fused (80G instructions in 40G uops) and executing at 2 test-and-branch uops per clock (20G cycles). Also that MITE is delivering uops every cycle, 20G MITE cycles. And what it does deliver is apparently 2 uops per cycle, at least on average.
A test with alternating groups of NOPs and not-taken branches might be good to see what happens when there's room for the IDQ to accept more uops from MITE, to see if it will send non-fused test and JCC uops to the IDQ.
Further tests:
Backwards jcc rel8
for all the branches made no difference, same perf results:
%assign i 0
%rep 399 ; the loop branch makes 400 total
.dummy% i:
test ecx, ecx
jz .dummy % i
%assign i i 1
%endrep
MITE throughput: alternating groups of NOPs and macro-fused branches
The NOPs still need to get decoded, but the back-end can blaze through them. This makes total MITE throughput the only bottleneck, instead of being limited to 2 uops / clock regardless of how many MITE could produce.
global _start
_start:
mov ecx, 100000000
ALIGN 32
.loop:
%assign i 0
%rep 10
%rep 8
.dummy% i:
test ecx, ecx
jz .dummy % i
%assign i i 1
%endrep
times 24 nop
%endrep
dec ecx
jnz .loop
.exit_loop:
mov eax, 231
syscall ; exit_group(EDI)
Performance counter stats for './fusion':
2,594.14 msec task-clock # 1.000 CPUs utilized
0 context-switches # 0.000 /sec
0 cpu-migrations # 0.000 /sec
1 page-faults # 0.385 /sec
10,112,077,793 cycles # 3.898 GHz
40,200,000,813 instructions # 3.98 insn per cycle
32,100,317,400 uops_issued.any # 12.374 G/sec
8,100,250,120 uops_executed.thread # 3.123 G/sec
10,100,772,325 idq.all_mite_cycles_any_uops # 3.894 G/sec
32,100,146,351 idq.mite_uops # 12.374 G/sec
2.594423202 seconds time elapsed
2.593606000 seconds user
0.000000000 seconds sys
So it seems MITE couldn't keep up with 4-wide issue. The blocks of 8 branches are making the decoders produce significantly less than 5 uops per clock; probably only 2 like we were seeing for longer runs of test/jcc
.
24 nops can decode in
Reducing to groups of 3 test/jcc and 29 nop
gets it down to 8.607 Gcycles for MITE active 8.600 cycles, with 32.100G MITE uops. (3.099 G uops_retired.macro_fused
, with the .1 coming from the loop branch.) Still not saturating the front-end with 4.0 uops per clock, like I was hoping it might with a macro-fusion at the end of one decode group.
It is hitting 4.09 IPC, so at least the decoders and issue bottleneck are ahead of where they'd be with no macro-fusion.
(Best case for macro-fusion is 6.0 IPC, with 2 fusions per cycle and 2 other uops from non-fusing instructions. That's separate from unfused-domain back-end uop throughput limits via micro-fusion, see this test for ~7 uops_executed.thread
per clock.)
Even %rep 2
test/JCC hurts throughput, which seems to indicate that it just stops decoding after making 2 fusions, not even decoding 2 or 3 more NOPs after that. (For some lower NOP counts, we get some uop-cache activity because the outer rep count isn't big enough to totally fill up the uop cache.)
You can test this in a shell loop like for NOPS in {0..20}; do nasm ... -DNOPS=$NOPS ...
with the source using times NOPS nop
.
There are some plateau/step effects in total cycles vs. number of NOPS for %rep 2
, so maybe the two test/JCC uops are decoding at the end of a group, with 1, 2, or 3 NOPs before them. (But it's not super consistent, especially for lower numbers of NOPS. But NOPS=16, 17 and 18 are all right around 5.22 Gcycles, with 14 and 15 both at 4.62 Gcycles.)
There are a lot of possibly-relevant perf counters if we want to really get into what's going on, e.g. idq_uops_not_delivered.cycles_fe_was_ok
(cycles where the issue stage got 4 uops, or where the back-end was stalled so it wasn't the front-end's fault.)