For example,
Let's say a variable x
,
x
could be anything include 0
.
Then we got code like:
if(x==0) {
y = 1;
}
else {
y = x;
}
Could I do this without producing branches in C/C ?
I'm trying to optimize a piece of code. I want to remove branches as much as possible. There are similar judgments, so I want to convert them into statements without branches to make the code as efficient as possible.
CodePudding user response:
Some general notes:
- As mentioned in other comments, some compilers can optimize and eliminate the branch. You can check the assembly output (e.g. in Godbolt) to make sure.
- Beware of premature optimizations.
- Always measure and make sure your speculation about what's taking up time is correct.
Having said that you can try the following "trick":
y = !x x;
Assuming x
,y
are integer types:
- If
x==0
,!x
will be1
andy
will be assigned the value1
. - If
x!=0
,!x
will be0
andy
will be assigned the valuex
.
Note: see @CostantinoGrana's comment below about the guarantee in the standard. You can also verify it in your specific environment (compiler etc.).
CodePudding user response:
You should check the assembler output of your compiler. For example,the X86 architecture has an instruction called cmov (https://www.felixcloutier.com/x86/cmovcc) that is designed to handle this kind of thing without branches. The Arm architecture allows pretty much all instructions to be executed depending on CPU flags: https://developer.arm.com/documentation/dui0473/m/condition-codes/conditional-execution-in-arm-state.
Thus, when optimizations are enabled your compiler will likely produce no branches.
CodePudding user response:
If you need this non-zero guarantee for an input to __builtin_clz
(which gives an undefined result for an input of zero, otherwise counts leading-zero bits), a useful trick is y = x|1
. That doesn't affect the number of leading zeros for any non-zero input, because you'd setting the last position CLZ will look at, only if all the higher bits are zero. If x=1
, then x|1
is still 1
.
Similarly as setup for __builtin_ctz
(count trailing zeros), y = x | (1UL<<31)
will set the MSB. Or more portably, x | ( 1ULL << (sizeof(x)*CHAR_BIT-1) )
.
Unconditionally setting one of the bits will change some non-zero values; this is only useful in certain cases.
To get exactly the semantics you asked for, y = x ? x : 1
can be expressed that way, or as x !x
as suggested in another answer, which asks the compiler to materialize a boolean integer and add it.
Of course, a smart compiler will hopefully pick a way that's efficient on the target machine. For x86, one might hope that a smart compiler would do something like this if it's ok to overwrite x
:
cmp eax, 1 ; CF=1 only if EAX==0 because only 0 is < 1U
adc eax, 0 ; x = (eax < 1U)
While for AArch64, tst
or cmp
and then cinc
can conditionally copy-and-increment a value depending on any flags conditions.
On the Godbolt compiler explorer -
int make_nonzero_booleanize(int x){
return x !x;
}
int make_nonzero_ternary(int x){
return x ? x : 1;
}
int make_nonzero_if(int x){
// the original. Compilers may compile it to branchless code
// especially gcc -O3 instead of -O2 is more aggressive about if-conversion
int y;
if (x)
y = x;
else
y = 1;
return y;
}
# GCC12 -O3 for x86-64
make_nonzero_booleanize(int): # return x !x
cmp edi, 1
mov eax, edi
adc eax, 0 # Apparently a GCC dev already thought of this peephole and taught GCC10 about it
# I was expecting compilers to naively do xor-zeroing test/setcc add
ret
make_nonzero_ternary(int):
mov eax, edi
test edi, edi
mov edx, 1
cmove eax, edx # standard ternary using conditional-move
ret
make_nonzero_if(int):
# same asm as ternary; GCC did an if-conversion optimization
// ARM64 gcc12 -O2
make_nonzero_booleanize(int):
cmp w0, 0
cinc w0, w0, eq // return x==0 ? x : x 1
ret
make_nonzero_ternary(int):
cmp w0, 0
csinc w0, w0, wzr, ne // return x==0 ? x : 0 1
ret
make_nonzero_if(int):
// identical to ternary, if-conversion even with -O2
Both ways are equally good; the csinc
result still has a data dependency on the input w0
even if the condition is true and the result comes from incrementing the zero register (wzr
).
# RISC-V 32-bit clang 15 -O3
make_nonzero_booleanize(int):
seqz a1, a0 # tmp = (x==0)
add a0, a0, a1 # return x tmp
ret
make_nonzero_ternary(int):
mv a1, a0 # tmp = x
li a0, 1 # retval = 1
beqz a1, .LBB1_2 # if (x != 0) {
mv a0, a1 # retval = tmp
.LBB1_2: # }
ret # return retval
make_nonzero_if(int):
# identical to ternary, same branching.
Clang didn't even do a good job with the branching strategy it picked here; it could have done a bnez
over a retval=1
, avoiding both mv
instructions. Unless there's some tuning thing where a branch over an mv
is handled by some RISC-V microarchitectures as a conditional-move? Hopefully after inlining it wouldn't be this bad, and might put a branch target for the ==0
special case somewhere else, where it didn't have to jump over it.
So x !x
is better with some real-world compilers on x86-64 and RISC-V, and equal on AArch64.
If you care about this in some specific context with a specific compiler for some ISA, look at how it compiles in your code. (If you know anything about what's efficient in asm or not.)
Older GCC versions:
GCC7 and earlier does better with the ternary, avoiding the mov eax, edi
, instead doing mov eax, 1
and then using cmovnz
to copy EDI or not. Hopefully the inefficiency in GCC8 and later only happens when not inlining this tiny function; GCC is known to waste instructions when it has to work around hard register constraints from calling conventions or inline asm.
GCC9 and earlier doesn't know the cmp
/adc
trick for x86-64, instead using a literal implementation of x !x
, which is unfortunately not great because x86 sucks at materializing a boolean condition as a 0/1 integer due to the poor design of setcc
.
# GCC9 and earlier -O3
make_nonzero_booleanize(int):
xor eax, eax # zero the full reg
test edi, edi # set FLAGS according to x
sete al # EAX = !x
add eax, edi # EAX = x
ret
To be fair, adc eax, 0
costs 2 uops on Intel before Sandybridge. (And GCC tuning choices maybe didn't realize that adc
with immediate 0 was special cased; in general adc
costs 2 uops on Intel before Broadwell.)
Branchless might be slower if x == 0
is very rare
Even in the best case, the branchless code for x86-64, AArch64, and RISC-V is lengthening the critical path dependency chain by 2 instructions, both with 1 cycle latency on most CPUs. (For example, csinc
can't run until the flags result of cmp
is ready.)
But a correctly predicted branch is a control dependency, so it's handled separately by the CPU, optimistically assuming that y=x
will be the case, only bailing out if it later detects a mispredict.
But writing an if()
in the C source doesn't guarantee you'll get branchy asm. (That or an ||
makes it more likely to compile branchy than other ways of writing it, though.)
Related:
- gcc optimization flag -O3 makes code slower than -O2 - a case where GCC's branchless code-gen for x86-64 was slower than branchy, when the branch is highly predictable. (Especially because GCC made the loop-carried dependency chain longer than necessary.)
-O3
is more aggressive about if-conversion to branchless, but it didn't auto-vectorize. - Why is a conditional move not vulnerable to Branch Prediction Failure?
- http://yarchive.net/comp/linux/cmov.html - Linus Torvalds on
cmov
, back in the Pentium 4 days when it was slower.
CodePudding user response:
Maybe you can do this just like you are doing it in perl: (void)((y = x) || (y = x 1));
CodePudding user response:
well you could always do (x * x) 1 a number squared can never be negative (unless imaginary) and if both equal zero the result would be 1