Why does MOVD/MOVQ between GP and SIMD registers have quite high latency?-CodePudding

Usually movs between registers are very cheap operations, but I wonder why movd/movq between GP and SIMD registers have quite high latency.

Looking at the latency of movd r32, xmm on most recent CPUs from uops.info,

Alder Lake-P | ≤3
Alder Lake-E | ≤9
Zen 3        | ≤5

For Pentium 4, Agner Fog's instruction tables shows that the latency of movd r32, xmm is 10. This is even higher than the latency of pextrw r32, xmm, i on the same processor, which is 9.

CodePudding user response：

It's probably not as bad as 9 or 5 cycles on Alder Lake-E, or Zen 3. But yes, on Intel CPUs, one direction of a movd XMM<->GP round trip is probably 3 cycles, the other probably 1.

Physically and logically separate areas of the CPU

GP-integer and SIMD/FP are in different register files, and different forwarding networks. These are physically separate areas of the CPU, so part of this latency is just propagation / wire delays, not gate delays from computation. Unlike with extra bypass latency for SIMD-integer to SIMD-FP, the data has to go through an execution unit to get from one domain to the other, and that execution unit has to be somewhere.

Despite that, mainstream Intel cores do remarkably well at it. A total round-trip latency of 4 cycles for movd r8d, xmm0 / movd xmm0, r8d on Haswell and later, and Nehalem. (perhaps a 3:1 split) (Source: Alder Lake P Experiment 5 in uops.info's latency testing). Sandy/Ivy bridge experimented with bringing the round trip latency down to 2 cycles total, so 1 in each direction, but it seems they relaxed that for Haswell.

AMD CPUs have separate scheduling domains for FP vs. integer, vs. Intel sharing execution ports between integer and FP execution units. (e.g. see a block diagram of Zen 2 vs. Skylake, or Kanter's Haswell writeup.) This might contribute to some delay in GP<->XMM, like maybe not being able to wake up a dependent uop to forward to in the same cycle another execution unit puts it on the bypass forwarding network. (Or not practical, in terms of power and maintaining separation of scheduling).

Bulldozer-family was even worse, with a cluster of 2 weak integer cores sharing a SIMD/FP unit, leading to quite high latencies for XMM<->integer, especially before Steamroller.

Intel low-power cores like Alder Lake E-cores (Gracemont) might be different. Wikichip doesn't have a block diagram for it, but Tremont, the previous generation, is likely similar. Unified ROB is pretty much necessary to support precise exceptions via program-order retirement (like Zen 2's retire queue), but separate RSes for each integer ports and for the two FP ports. (Gracemont significantly beefed up the SIMD part, but the key point is that they're probably not as tightly coupled the way they are on microarchitectures derived from Sandybridge, which itself descended from P6 with some Netburst influence).

The actual movd round-trip latency on Alder Lake-E is 10 cycles total.

Re: regular mov being cheap:
Obviously mov-elimination at rename can't work because GP and XMM registers don't live in the same register file, but even before that mov, movdqa, or movaps between the same type of register could run on any execution unit because there's no forwarding between domains, just a trivial execution unit.

Measurement challenges: diving up round trip latency

You can only easily measure the latency of a sequence that forms a loop carried dependency chain. (You put it in a loop to create a bottleneck with that critical path length.) A one-way transfer between different register types makes this a lot harder.

https://uops.info/ is very conservative, assuming that every other instruction in the dep chain has 1 cycle latency. That's why they list latency as <= 9, not 9. As mentioned earlier, Alder Lake E's round-trip latency is 10 cycles total. That could be 5 and 5.

If you could make some estimates about load-use latency for movq xmm0, [rax] (address to data), you could use that to build a round trip involving a load of somewhat known latency in one direction, narrowing down the range of uncertainty for the movq rax, xmm0.

InstLatx64 / AIDA64 just gives up in these cases, e.g. listing MOVD r32, xmm as L: [diff. reg. set] / T: 0.50ns= 1.00c on an Ice Lake.

Agner Fog measures a round-trip and somewhat arbitrarily divide the latency between parts to add up to the total round trip. I think it's unlikely that pextrw r,x,i is actually lower latency than movd r32, xmm on P4, although weird things are always possible on P4.

But notice that Agner Fog's latency numbers for mov reg, mem aren't the load-use latency (e.g. of mov rax, [rax]), they're an arbitrary fraction of store/reload latency so load latency store latency add up to the total round trip. This is one of the biggest problems with Agner Fog's data, separate from the occasional human error (copy/paste a wrong number, or its inverse, or missing testing of some instructions a CPU supports.)

Another weakness of Agner's data is only measuring one kind of latency, not from each input separately. It was a very valuable effort, and vastly better than nothing, but automating the testing like https://uops.info/ has proven useful in shining more light on things, as well as eliminating typos. Agner's tables are still good for historical info on older CPUs that uops.info didn't test, but I don't normally look at them for CPUs that uops.info has tested.

(Agner's microarch guide on the other hand is still a gold mine, investigating various effects other than uop performance.)

Physically and logically separate areas of the CPU

Measurement challenges: diving up round trip latency

Related: