Does RSQRTSS break the dependency on the destination register?-CodePudding

Using uiCA I produced a trace table for the following code.

cvtsi2ss xmm0, eax
addss xmm0, xmm0

https://uica.uops.info/tmp/780bce9e56ee4a718d5369deb1326215_trace.html

You can see that each cvtsi2ss has to wait for the previous iteration to finish because it depends on some bits (32:127) of xmm0.

However, changing cvtsi2ss to rsqrtss makes a big difference.

rsqrtss xmm0, xmm1
addss xmm0, xmm0

https://uica.uops.info/tmp/8897a7d45c8348e68279aea4d0b18e15_trace.html

Each rsqrtss executes in parallel with the previous iteration. I don't understand because rsqrtss produces an output with bits 32:127 unchanged, just like cvtsi2ss, so I think it should wait for any operation on the output register to finish, just like cvtsi2ss did.

After reading the answer, I ran a simple test, and it seems sure that uiCA has a bug.

IACA also fails to catch the output dependency.

Please correct me if the test code is wrong.

__asm__ (
    R"(.section .text
    .balign 16
noXor:
    mov eax, 0x3f800000
    movd xmm1, eax
    rdtscp
    shl rdx, 32
    or rax, rdx
    mov rdi, rax
    mov ecx, 1 << 30
    jmp noXor_loop
    .balign 16
noXor_loop:
    rsqrtss xmm0, xmm1
    addss xmm0, xmm0
    dec ecx
    jnz noXor_loop
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, rdi
    ret
    .balign 16
yesXor:
    mov eax, 0x3f800000
    movd xmm1, eax
    rdtscp
    shl rdx, 32
    or rax, rdx
    mov rdi, rax
    mov ecx, 1 << 30
    jmp yesXor_loop
    .balign 16
yesXor_loop:
    xorps xmm0, xmm0
    rsqrtss xmm0, xmm1
    addss xmm0, xmm0
    dec ecx
    jnz yesXor_loop
    rdtscp
    shl rdx, 32
    or rax, rdx
    sub rax, rdi
    ret)"
);

unsigned long long noXor(void);
unsigned long long yesXor(void);

#include <stdio.h>

int main() {
    for (int i = 0; i < 4;   i) {
        printf("noXor: %llu yesXor: %llu\n", noXor(), yesXor());
    }
    return 0;
}

noXor: 4978836501 yesXor: 696810039
noXor: 4971780086 yesXor: 690780109
noXor: 4977293771 yesXor: 687404710
noXor: 5499602729 yesXor: 687954399

CodePudding user response：

Test on real hardware and you'll see the expected result: rsqrtss xmm0, xmm1 has 4 cycle latency as part of the xmm0 -> xmm0 dependency chain. (On my Skylake).

That's a bug in UICA. Or actually in the https://uops.info/ data it uses - your trace includes a link to the rsqrtss page on uops.info, where we can see they only measured latency for the operand 2 → 1 case, no entry for 1 → 1. When that test was written, the author maybe copy/pasted the rsqrtps test and forgot to add a test for an output dependency.

Without testing yourself on a real CPU, you should only be surprised if there was an actual test that measured zero latency for 1 → 1 of rsqrtss. Without such a test, the correct assumption is that the test is missing (and thus the UICA result is wrong), not that the latency is actually zero.

Many instructions don't have output dependencies on any CPUs, so it makes sense https://uops.info/ doesn't test them all. We'd rather have rsqrtps latency listed as 4 than [0:4], unless there are some CPUs where it's non-zero.

It makes sense that UICA will assume no output dependency when there's no test data; that's normal for instructions without output dependencies.

But of course rsqrtss should test latency from each operand separately to the destination. It's non-zero, but it's at least possible in theory for a CPU to allow late forwarding for the merge target so it's wise to actually test instead of assuming it's the same as for the proper source. (I don't know of any x86 CPUs where a single uop allows late forwarding, so different latencies from different inputs usually only happens with multi-uop instructions. Unlike some ARM CPUs where their FMA and/or integer MAC units allow late forwarding for the addend.)

BTW, your reasoning is correct, rsqrtss does have a dependency unless the hardware does partial-register renaming for XMM regs. But no real-word hardware does that.

PIII and Pentium-M had to write each 64-bit half of an XMM separately, and maybe could write one half without the other, but rsqrtss leaves half of that low half unmodified, thanks to Intel's short-sighted design choices. (Now I'm curious whether Pentium-M cvtsi2sd xmm0, eax or sqrtsd xmm0, xmm1 has a false output dependency.) But current CPUs write a whole XMM register at once.

The AVX version vrsqrtss even takes an extra source operand to merge with, which can be separate from the destination the result is written to.