It's well known that we can use the CMOV instruction to write branchless code, but I was wondering if I'm writing the equivalent of x = cond ? 1 : 2
, should I prefer
CMOVE rax, 1 #1a
CMOVNE rax, 2 #1b
or
MOV rax, 1 #2a
CMOVNE rax, 2 #2b
In theory the first one can execute pretty much in parallel while the second one is slower because of a data dependency. But I'm not sure how it is in reality.
CodePudding user response:
The second one seems better.
First, note that CMOVcc
does not have an immediate form; the second operand must also be a register (or memory). And so CMOVcc rax, rbx
actually has three input dependencies; rax
, rbx
, and flags
. rax
is an input dependency because if cc
is false, then the output value in rax
must equal the value it had before. So the instruction will always stall until all three of the inputs are ready.
You might imagine a "conditional dependency" where the instruction need not stall on rax
if cc
is true, but I don't believe any currently existing machine can actually do that. Generally, a conditional move is treated as an arithmetic/logic instruction, kind of similar to adc rax, rbx
: it computes some function of rax
, rbx
and the flags, and leaves the result in rax
. You could think of that function as something like rax = (~mask & rax) | (mask & rbx)
.
So the first example would look more completely like
mov rbx, 1
mov rcx, 2
cmp whatever
cmove rax, rbx
cmovne rax, rcx
(I know we should be doing it all with 32-bit registers to save the REX prefixes, but it's just an example.)
And we can now see several problems.
The
cmov
s must wait forrbx
andrcx
to be ready, but that's probably not an issue; the immediatemov
s have no input dependency at all, so they could have been executed out-of-order long ago.More seriously, the second
cmov
has an input dependency on the first one's output, viarax
. So the twocmov
s must actually execute serially and not in parallel.And worse, the first
cmov
has a false dependency on whatever the previous value ofrax
was. If some earlier code is doing something slow withrax
, this code will stall until that other code is done, even though the value left inrax
by the earlier code should be totally irrelevant to this snippet.
The second alternative would look like:
mov rax, 1
mov rbx, 2
cmp whatever
cmovne rax, rbx
This avoids most of those issues. Now the only real input dependency is the flags produced by the cmp
(and thus on whatever the inputs to cmp
are). As before, the immediate mov
s to rax
and rbx
have no input dependencies and can be done in advance. And there is no false dependency on the previous value of rax
because mov rax, 1
definitively wipes it out. As a final bonus, this version is one instruction shorter.