Can two fuseable pairs be decoded in the same clock cycle?-CodePudding

I'm trying to verify the conclusion that two fuseable pairs can be decoded in the same clock cycle, using my Intel i7-10700 and ubuntu 20.04.

The test code is arranged like below, and it is copied like 8000 times to avoid the influence of LSD and DSB (to use MITE mostly).

ALIGN 32
.loop_1:
    dec ecx
    jge .loop_2
.loop_2:
    dec ecx
    jge .loop_3
.loop_3:
    dec ecx
    jge .loop_4
.loop_4:
.loop_5:
    dec ecx
    jge .loop_6

The test result tells that only one pair is fused in a single cycle. ( r479 div r1002479 )

 Performance counter stats for process id '22597':

   120,459,876,711      cycles                                                      
    35,514,146,968      instructions     #    0.29  insn per cycle         
    17,792,584,278      r479             # r479: Number of uops delivered                     
                                         # to Instruction Decode Queue (IDQ) from MITE path                                  
        50,968,497      r4002479        
                                         
                                                  
    17,756,894,879      r1002479         # r1002479: Cycles MITE is delivering any Uop                                              

      26.444208448 seconds time elapsed

I don't think Agner's conclusion is wrong. Therefore, is there something wrong with my perf usage, or did I fail to find insights in the code?

CodePudding user response：

On Haswell and later, yes. On Ivy Bridge and earlier, no.

On Ice Lake and later, Agner Fog says macro-fusion is done right after decode, instead of in the decoders which required the pre-decoders to send the right chunks of x86 machine code to decoders accordingly. (And Ice Lake has slightly different restrictions: Instructions with a memory operand cannot fuse, unlike previous CPU models. Instructions with an immediate operand can fuse.) So on Ice Lake, macro-fusion doesn't let the decoders handle more than 5 instructions per clock.

Wikichip claims that only 1 macro-fusion per clock is possible on Ice Lake, but that's probably incorrect. Harold tested with my microbenchmark on Rocket Lake and found the same results as Skylake. (Rocket Lake uses a Cypress Cove core, a variant of Sunny Cove back-ported to a 14nm process, so it's likely that it's the same as Ice Lake in this respect.)

Your results indicate that uops_issued.any is about half instructions, therefore you are seeing macro-fusion of most pairs. (You could also look at the uops_retired.macro_fused perf event. BTW, modern perf has symbolic names for most uarch-specific events: use perf list to see them.)

The decoders will still produce up-to-four or even five uops per clock on Skylake-derived microarchitectures, though, even if they only make two macro-fusions. You didn't look at how many cycles MITE is active, so you can't see that execution stalls most of the time, until there's room in the ROB / RS for an issue-group of 4 uops. And that opens up space in the IDQ for a decode group from MITE.

You have three other bottlenecks in your loop:

Loop-carried dependency through dec ecx: only 1/clock because each dec has to wait for the result of the previous to be ready.
Only one taken branch can execute per cycle (on port 6), and dec/jge is taken almost every time, except for 1 in 2^32 when ECX was 0 before the dec.
The other branch execution unit on port 0 only handles predicted-not-taken branches. https://www.realworldtech.com/haswell-cpu/4/ shows the layout but doesn't mention that limitation; Agner Fog's microarch guide does.
Branch prediction: even jumping to the next instruction, which is architecturally a NOP, is not special cased by the CPU. Slow jmp-instruction (Because there's no reason for real code to do this, except for call 0 / pop which is special cased at least for the return-address predictor stack.)

This is why you're executing at significantly less than one instruction per clock, let alone one uop per clock.

Working demo of 2 fusions per clock

Surprisingly to me, MITE didn't go on to decode a separate test and jcc in the same cycle as it made two fusions. I guess the decoders are optimized for filling the uop cache. (A similar effect on Sandybridge / IvyBridge is that if the final uop of a decode-group is potentially fusable, like dec, decoders will only produce 3 uops that cycle, in anticipation of maybe fusing the dec next cycle. That's true at least on SnB/IvB where the decoders can only make 1 fusion per cycle, and will decode separate ALU jcc uops if there is another pair in the same decode group. Here, SKL is choosing not to decode a separate test uop (and jcc and another test) after making two fusions.)

global _start
_start:
   mov ecx, 100000000
ALIGN 32
.loop:
%rep 399          ; the loop branch makes 400 total
   test ecx, ecx
   jz  .exit_loop        ; many of these will be 6-byte jcc rel32
%endrep
   dec  ecx
   jnz  .loop

.exit_loop:
   mov eax, 231
   syscall          ; exit_group(EDI)

On i7-6700k Skylake, perf counters for user-space only:

$ nasm -felf64 fusion.asm && ld fusion.o -o fusion       # static executable
$ taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,idq.all_mite_cycles_any_uops,idq.mite_uops -r2 ./fusion

 Performance counter stats for './fusion' (2 runs):

          5,165.34 msec task-clock                #    1.000 CPUs utilized            (  -  0.01% )
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 1      page-faults               #    0.194 /sec                   
    20,130,230,894      cycles                    #    3.897 GHz                      (  -  0.04% )
    80,000,001,586      instructions              #    3.97  insn per cycle           (  -  0.00% )
    40,000,677,865      uops_issued.any           #    7.744 G/sec                    (  -  0.00% )
    40,000,602,728      uops_executed.thread      #    7.744 G/sec                    (  -  0.00% )
    20,100,486,534      idq.all_mite_cycles_any_uops #    3.891 G/sec                    (  -  0.00% )
    40,000,261,852      idq.mite_uops             #    7.744 G/sec                    (  -  0.00% )

          5.165605  - 0.000716 seconds time elapsed  (  -  0.01% )

Not-taken branches aren't a bottleneck, perhaps because my loop is big enough to defeat the DSB (uop cache), but not too big to defeat branch prediction. (Actually, the JCC erratum mitigation on Skylake will definitely defeat the DSB: if everything is a macro-fused branch, there will be one touching the end of every 32-byte region. Only if we start introducing NOPs or other instructions between branches will the uop cache be able to operate.)

We can see that everything was fused (80G instructions in 40G uops) and executing at 2 test-and-branch uops per clock (20G cycles). Also that MITE is delivering uops every cycle, 20G MITE cycles. And what it does deliver is apparently 2 uops per cycle, at least on average.

A test with alternating groups of NOPs and not-taken branches might be good to see what happens when there's room for the IDQ to accept more uops from MITE, to see if it will send non-fused test and JCC uops to the IDQ.

Further tests:

Backwards jcc rel8 for all the branches made no difference, same perf results:

%assign i 0 
%rep 399          ; the loop branch makes 400 total
   .dummy% i:
   test ecx, ecx
   jz  .dummy %  i
   %assign i i 1
%endrep

MITE throughput: alternating groups of NOPs and macro-fused branches

The NOPs still need to get decoded, but the back-end can blaze through them. This makes total MITE throughput the only bottleneck, instead of being limited to 2 uops / clock regardless of how many MITE could produce.

global _start
_start:
   mov ecx, 100000000
ALIGN 32
.loop:
%assign i 0 
%rep 10
 %rep 8
   .dummy% i:
   test ecx, ecx
   jz  .dummy %  i
   %assign i i 1
 %endrep
 times 24 nop
%endrep

   dec  ecx
   jnz  .loop

.exit_loop:
   mov eax, 231
   syscall          ; exit_group(EDI)

 Performance counter stats for './fusion':

          2,594.14 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 /sec                   
                 0      cpu-migrations            #    0.000 /sec                   
                 1      page-faults               #    0.385 /sec                   
    10,112,077,793      cycles                    #    3.898 GHz                    
    40,200,000,813      instructions              #    3.98  insn per cycle         
    32,100,317,400      uops_issued.any           #   12.374 G/sec                  
     8,100,250,120      uops_executed.thread      #    3.123 G/sec                  
    10,100,772,325      idq.all_mite_cycles_any_uops #    3.894 G/sec                  
    32,100,146,351      idq.mite_uops             #   12.374 G/sec                  

       2.594423202 seconds time elapsed

       2.593606000 seconds user
       0.000000000 seconds sys

So it seems MITE couldn't keep up with 4-wide issue. The blocks of 8 branches are making the decoders produce significantly less than 5 uops per clock; probably only 2 like we were seeing for longer runs of test/jcc.

24 nops can decode in

Reducing to groups of 3 test/jcc and 29 nop gets it down to 8.607 Gcycles for MITE active 8.600 cycles, with 32.100G MITE uops. (3.099 G uops_retired.macro_fused, with the .1 coming from the loop branch.) Still not saturating the front-end with 4.0 uops per clock, like I was hoping it might with a macro-fusion at the end of one decode group.
It is hitting 4.09 IPC, so at least the decoders and issue bottleneck are ahead of where they'd be with no macro-fusion.
(Best case for macro-fusion is 6.0 IPC, with 2 fusions per cycle and 2 other uops from non-fusing instructions. That's separate from unfused-domain back-end uop throughput limits via micro-fusion, see this test for ~7 uops_executed.thread per clock.)

Even %rep 2 test/JCC hurts throughput, which seems to indicate that it just stops decoding after making 2 fusions, not even decoding 2 or 3 more NOPs after that. (For some lower NOP counts, we get some uop-cache activity because the outer rep count isn't big enough to totally fill up the uop cache.)

You can test this in a shell loop like for NOPS in {0..20}; do nasm ... -DNOPS=$NOPS ... with the source using times NOPS nop.

There are some plateau/step effects in total cycles vs. number of NOPS for %rep 2, so maybe the two test/JCC uops are decoding at the end of a group, with 1, 2, or 3 NOPs before them. (But it's not super consistent, especially for lower numbers of NOPS. But NOPS=16, 17 and 18 are all right around 5.22 Gcycles, with 14 and 15 both at 4.62 Gcycles.)

There are a lot of possibly-relevant perf counters if we want to really get into what's going on, e.g. idq_uops_not_delivered.cycles_fe_was_ok (cycles where the issue stage got 4 uops, or where the back-end was stalled so it wasn't the front-end's fault.)