Home > Enterprise >  In assembly, should branchless code use complementary CMOVs?
In assembly, should branchless code use complementary CMOVs?

Time:11-24

It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2, should I prefer

CMOVE rax, 1    #1a
CMOVNE rax, 2   #1b

or

MOV rax, 1      #2a
CMOVNE rax, 2   #2b

In theory the first one can execute pretty much in parallel while the second one is slower because of a data dependency. But I'm not sure how it is in reality.

CodePudding user response:

The second one seems better.

First, note that CMOVcc does not have an immediate form; the second operand must also be a register (or memory). And so CMOVcc rax, rbx actually has three input dependencies; rax, rbx, and flags. rax is an input dependency because if cc is false, then the output value in rax must equal the value it had before. So the instruction will always stall until all three of the inputs are ready.

You might imagine a "conditional dependency" where the instruction need not stall on rax if cc is true, but I don't believe any currently existing machine can actually do that. Generally, a conditional move is treated as an arithmetic/logic instruction, kind of similar to adc rax, rbx: it computes some function of rax, rbx and the flags, and leaves the result in rax. You could think of that function as something like rax = (~mask & rax) | (mask & rbx).

So the first example would look more completely like

mov rbx, 1
mov rcx, 2
cmp whatever
cmove rax, rbx
cmovne rax, rcx

(I know we should be doing it all with 32-bit registers to save the REX prefixes, but it's just an example.)

And we can now see several problems.

  • The cmovs must wait for rbx and rcx to be ready, but that's probably not an issue; the immediate movs have no input dependency at all, so they could have been executed out-of-order long ago.

  • More seriously, the second cmov has an input dependency on the first one's output, via rax. So the two cmovs must actually execute serially and not in parallel.

  • And worse, the first cmov has a false dependency on whatever the previous value of rax was. If some earlier code is doing something slow with rax, this code will stall until that other code is done, even though the value left in rax by the earlier code should be totally irrelevant to this snippet.

The second alternative would look like:

mov rax, 1
mov rbx, 2
cmp whatever
cmovne rax, rbx

This avoids most of those issues. Now the only real input dependency is the flags produced by the cmp (and thus on whatever the inputs to cmp are). As before, the immediate movs to rax and rbx have no input dependencies and can be done in advance. And there is no false dependency on the previous value of rax because mov rax, 1 definitively wipes it out. As a final bonus, this version is one instruction shorter.

  • Related