Using uiCA I produced a trace table for the following code.
cvtsi2ss xmm0, eax
addss xmm0, xmm0
https://uica.uops.info/tmp/780bce9e56ee4a718d5369deb1326215_trace.html
You can see that each cvtsi2ss
has to wait for the previous iteration to finish because it depends on some bits (32:127
) of xmm0
.
However, changing cvtsi2ss
to rsqrtss
makes a big difference.
rsqrtss xmm0, xmm1
addss xmm0, xmm0
https://uica.uops.info/tmp/8897a7d45c8348e68279aea4d0b18e15_trace.html
Each rsqrtss
executes in parallel with the previous iteration. I don't understand because rsqrtss
produces an output with bits 32:127
unchanged, just like cvtsi2ss
, so I think it should wait for any operation on the output register to finish, just like cvtsi2ss
did.
After reading the answer, I ran a simple test, and it seems sure that uiCA has a bug.
IACA also fails to catch the output dependency.
Please correct me if the test code is wrong.
__asm__ (
R"(.section .text
.balign 16
noXor:
mov eax, 0x3f800000
movd xmm1, eax
rdtscp
shl rdx, 32
or rax, rdx
mov rdi, rax
mov ecx, 1 << 30
jmp noXor_loop
.balign 16
noXor_loop:
rsqrtss xmm0, xmm1
addss xmm0, xmm0
dec ecx
jnz noXor_loop
rdtscp
shl rdx, 32
or rax, rdx
sub rax, rdi
ret
.balign 16
yesXor:
mov eax, 0x3f800000
movd xmm1, eax
rdtscp
shl rdx, 32
or rax, rdx
mov rdi, rax
mov ecx, 1 << 30
jmp yesXor_loop
.balign 16
yesXor_loop:
xorps xmm0, xmm0
rsqrtss xmm0, xmm1
addss xmm0, xmm0
dec ecx
jnz yesXor_loop
rdtscp
shl rdx, 32
or rax, rdx
sub rax, rdi
ret)"
);
unsigned long long noXor(void);
unsigned long long yesXor(void);
#include <stdio.h>
int main() {
for (int i = 0; i < 4; i) {
printf("noXor: %llu yesXor: %llu\n", noXor(), yesXor());
}
return 0;
}
noXor: 4978836501 yesXor: 696810039
noXor: 4971780086 yesXor: 690780109
noXor: 4977293771 yesXor: 687404710
noXor: 5499602729 yesXor: 687954399
CodePudding user response:
Test on real hardware and you'll see the expected result: rsqrtss xmm0, xmm1
has 4 cycle latency as part of the xmm0 -> xmm0 dependency chain. (On my Skylake).
That's a bug in UICA. Or actually in the https://uops.info/ data it uses - your trace includes a link to the rsqrtss
page on uops.info, where we can see they only measured latency for the operand 2 → 1
case, no entry for 1 → 1
. When that test was written, the author maybe copy/pasted the rsqrtps
test and forgot to add a test for an output dependency.
Without testing yourself on a real CPU, you should only be surprised if there was an actual test that measured zero latency for 1 → 1
of rsqrtss
. Without such a test, the correct assumption is that the test is missing (and thus the UICA result is wrong), not that the latency is actually zero.
Many instructions don't have output dependencies on any CPUs, so it makes sense https://uops.info/ doesn't test them all. We'd rather have rsqrtps
latency listed as 4
than [0:4]
, unless there are some CPUs where it's non-zero.
It makes sense that UICA will assume no output dependency when there's no test data; that's normal for instructions without output dependencies.
But of course rsqrtss
should test latency from each operand separately to the destination. It's non-zero, but it's at least possible in theory for a CPU to allow late forwarding for the merge target so it's wise to actually test instead of assuming it's the same as for the proper source. (I don't know of any x86 CPUs where a single uop allows late forwarding, so different latencies from different inputs usually only happens with multi-uop instructions. Unlike some ARM CPUs where their FMA and/or integer MAC units allow late forwarding for the addend.)
BTW, your reasoning is correct, rsqrtss
does have a dependency unless the hardware does partial-register renaming for XMM regs. But no real-word hardware does that.
PIII and Pentium-M had to write each 64-bit half of an XMM separately, and maybe could write one half without the other, but rsqrtss
leaves half of that low half unmodified, thanks to Intel's short-sighted design choices. (Now I'm curious whether Pentium-M cvtsi2sd xmm0, eax
or sqrtsd xmm0, xmm1
has a false output dependency.) But current CPUs write a whole XMM register at once.
The AVX version vrsqrtss
even takes an extra source operand to merge with, which can be separate from the destination the result is written to.