According to uops.info, the reciprocal throughput of vinserti128
is 0.5 if the xmm argument comes from memory, and 1 if the xmm argument is a register. What's the underlying reason behind this? Is it a mistake?
CodePudding user response:
Uops.info is correct. Memory-source is different.
As chtz commented, the register-source version works like a shuffle, but the memory-source version is probably decoding as a broadcast-load blend.
We know that Intel load ports can broadcast to 256-bit for free since Haswell (4, 8, or 16-byte chunks). vbroadcastss/d
/ vpbroadcastd/q
are single-uop, so are vbroadcasti128
/ vbroadcastf128
. Even vmovddup ymm, [mem]
is a single load-port uop on Intel (port 2 or 3), no ALU required for duplicating the low half of each 16-byte lane. (Filter on vbroadcast ymm
in https://uops.info/)
For the blend part, Haswell and later could reuse the vpblendd
execution unit by constructing a control operand for it from the vinserti128
immediate. That's a single uop for any vector ALU port (p015).
To avoid bypass latency, vinsertf128
could similarly use a vblendps
or pd
execution unit.
Sandy/Ivy Bridge could only broadcast for free to 128-bit XMM; it needed a port5 shuffle for vbroadcastss/sd/f128 ymm, mem
(and didn't have register-source broadcasts; those were new in AVX2). But it ran vinserti/f128 mem
as 1*p05 1*p23
vs p5
for register-source vinsert
or vperm2f128
.
SnB/IvB CPUs ran vblendps
/pd
on p05
, so presumably vinsertf128
from memory is using the FP blend units on those ports. (It wasn't until Haswell that port 1 could also run vblendps
.)
So presumably it had some special support for forwarding a 128-bit load result to either the low or high lane of an execution unit that could do the merging. (Perhaps with the blend units designed to accept that?)
If it could forward to both lanes, I'd have assumed it could do that in general, and wouldn't have needed a port 5 uop for vbroadcastf128 ymm, mem
on top of the load uop. Unless it's relevant that the intermediate broadcast result never needs to be architecturally visible (written to a hard YMM register), potentially persisting indefinitely if the register isn't written for a long time. But combined with the blend control input generated from the immediate, the input can be a bit funky and only the output needs to be fully normal and potentially part of a persistent state.