How can I guarantee that a variable will never be zero without using a conditional statement in C?-CodePudding

For example, Let's say a variable x, x could be anything include 0.
Then we got code like:

if(x==0) {
    y = 1;
}
else {
    y = x;
}

Could I do this without producing branches in C/C ?

I'm trying to optimize a piece of code. I want to remove branches as much as possible. There are similar judgments, so I want to convert them into statements without branches to make the code as efficient as possible.

CodePudding user response：

Some general notes:

As mentioned in other comments, some compilers can optimize and eliminate the branch. You can check the assembly output (e.g. in Godbolt) to make sure.
Beware of premature optimizations.
Always measure and make sure your speculation about what's taking up time is correct.

Having said that you can try the following "trick":

y = !x   x;

Assuming x,y are integer types:

If x==0, !x will be 1 and y will be assigned the value 1.
If x!=0, !x will be 0 and y will be assigned the value x.

Note: see @CostantinoGrana's comment below about the guarantee in the standard. You can also verify it in your specific environment (compiler etc.).

CodePudding user response：

You should check the assembler output of your compiler. For example,the X86 architecture has an instruction called cmov (https://www.felixcloutier.com/x86/cmovcc) that is designed to handle this kind of thing without branches. The Arm architecture allows pretty much all instructions to be executed depending on CPU flags: https://developer.arm.com/documentation/dui0473/m/condition-codes/conditional-execution-in-arm-state.

Thus, when optimizations are enabled your compiler will likely produce no branches.

CodePudding user response：

If you need this non-zero guarantee for an input to __builtin_clz (which gives an undefined result for an input of zero, otherwise counts leading-zero bits), a useful trick is y = x|1. That doesn't affect the number of leading zeros for any non-zero input, because you'd setting the last position CLZ will look at, only if all the higher bits are zero. If x=1, then x|1 is still 1.

Similarly as setup for __builtin_ctz (count trailing zeros), y = x | (1UL<<31) will set the MSB. Or more portably, x | ( 1ULL << (sizeof(x)*CHAR_BIT-1) ).

Unconditionally setting one of the bits will change some non-zero values; this is only useful in certain cases.

To get exactly the semantics you asked for, y = x ? x : 1 can be expressed that way, or as x !x as suggested in another answer, which asks the compiler to materialize a boolean integer and add it.

Of course, a smart compiler will hopefully pick a way that's efficient on the target machine. For x86, one might hope that a smart compiler would do something like this if it's ok to overwrite x:

    cmp  eax, 1          ; CF=1 only if EAX==0 because only 0 is < 1U
    adc  eax, 0          ; x  = (eax < 1U)

While for AArch64, tst or cmp and then cinc can conditionally copy-and-increment a value depending on any flags conditions.

On the Godbolt compiler explorer -

int make_nonzero_booleanize(int x){
    return x   !x;
}

int make_nonzero_ternary(int x){
    return x ? x : 1;
}

int make_nonzero_if(int x){
  // the original.  Compilers may compile it to branchless code
  // especially gcc -O3 instead of -O2 is more aggressive about if-conversion
    int y;
    if (x)
        y = x;
    else
        y = 1;
    return y;
}

# GCC12 -O3 for x86-64
make_nonzero_booleanize(int):           # return x   !x
        cmp     edi, 1
        mov     eax, edi
        adc     eax, 0       # Apparently a GCC dev already thought of this peephole and taught GCC10 about it
           # I was expecting compilers to naively do xor-zeroing   test/setcc   add
        ret

make_nonzero_ternary(int):
        mov     eax, edi
        test    edi, edi
        mov     edx, 1
        cmove   eax, edx     # standard ternary using conditional-move
        ret

make_nonzero_if(int):
         # same asm as ternary; GCC did an if-conversion optimization

// ARM64 gcc12 -O2
make_nonzero_booleanize(int):
        cmp     w0, 0
        cinc    w0, w0, eq       // return x==0 ? x : x 1
        ret
make_nonzero_ternary(int):
        cmp     w0, 0
        csinc   w0, w0, wzr, ne  // return x==0 ? x : 0 1
        ret
make_nonzero_if(int):
     // identical to ternary, if-conversion even with -O2

Both ways are equally good; the csinc result still has a data dependency on the input w0 even if the condition is true and the result comes from incrementing the zero register (wzr).

# RISC-V 32-bit clang 15 -O3
make_nonzero_booleanize(int):
        seqz    a1, a0                # tmp = (x==0)
        add     a0, a0, a1            # return x   tmp
        ret

make_nonzero_ternary(int):
        mv      a1, a0                  # tmp = x
        li      a0, 1                   # retval = 1
        beqz    a1, .LBB1_2             # if (x != 0) {
        mv      a0, a1                    # retval = tmp
.LBB1_2:                                # }
        ret                             # return retval

make_nonzero_if(int):
      # identical to ternary, same branching.

Clang didn't even do a good job with the branching strategy it picked here; it could have done a bnez over a retval=1, avoiding both mv instructions. Unless there's some tuning thing where a branch over an mv is handled by some RISC-V microarchitectures as a conditional-move? Hopefully after inlining it wouldn't be this bad, and might put a branch target for the ==0 special case somewhere else, where it didn't have to jump over it.

So x !x is better with some real-world compilers on x86-64 and RISC-V, and equal on AArch64.

If you care about this in some specific context with a specific compiler for some ISA, look at how it compiles in your code. (If you know anything about what's efficient in asm or not.)

Older GCC versions:

GCC7 and earlier does better with the ternary, avoiding the mov eax, edi, instead doing mov eax, 1 and then using cmovnz to copy EDI or not. Hopefully the inefficiency in GCC8 and later only happens when not inlining this tiny function; GCC is known to waste instructions when it has to work around hard register constraints from calling conventions or inline asm.

GCC9 and earlier doesn't know the cmp/adc trick for x86-64, instead using a literal implementation of x !x, which is unfortunately not great because x86 sucks at materializing a boolean condition as a 0/1 integer due to the poor design of setcc.

# GCC9 and earlier  -O3
make_nonzero_booleanize(int):
        xor     eax, eax       # zero the full reg
        test    edi, edi       # set FLAGS according to x
        sete    al             # EAX = !x
        add     eax, edi       # EAX  = x
        ret

To be fair, adc eax, 0 costs 2 uops on Intel before Sandybridge. (And GCC tuning choices maybe didn't realize that adc with immediate 0 was special cased; in general adc costs 2 uops on Intel before Broadwell.)

Branchless might be slower if `x == 0` is very rare

Even in the best case, the branchless code for x86-64, AArch64, and RISC-V is lengthening the critical path dependency chain by 2 instructions, both with 1 cycle latency on most CPUs. (For example, csinc can't run until the flags result of cmp is ready.)

But a correctly predicted branch is a control dependency, so it's handled separately by the CPU, optimistically assuming that y=x will be the case, only bailing out if it later detects a mispredict.

But writing an if() in the C source doesn't guarantee you'll get branchy asm. (That or an || makes it more likely to compile branchy than other ways of writing it, though.)

gcc optimization flag -O3 makes code slower than -O2 - a case where GCC's branchless code-gen for x86-64 was slower than branchy, when the branch is highly predictable. (Especially because GCC made the loop-carried dependency chain longer than necessary.) -O3 is more aggressive about if-conversion to branchless, but it didn't auto-vectorize.
Why is a conditional move not vulnerable to Branch Prediction Failure?
http://yarchive.net/comp/linux/cmov.html - Linus Torvalds on cmov, back in the Pentium 4 days when it was slower.

CodePudding user response：

Maybe you can do this just like you are doing it in perl: (void)((y = x) || (y = x 1));

CodePudding user response：

well you could always do (x * x) 1 a number squared can never be negative (unless imaginary) and if both equal zero the result would be 1

Older GCC versions:

Branchless might be slower if x == 0 is very rare

Branchless might be slower if `x == 0` is very rare