Better performance when function placed within conditional statement?-CodePudding

Bumped to this "anomaly" while bench-marking one implementation of copysign() function within round() implementation:

float copysign(float x, float y){
        float absx = std::fabs(x);
        /* use atan2 to distinguish -0. from 0. */
        if (y > 0.f || (y == 0.f && std::atan2(y, -1.f) > 0.f)) {
            return absx;
        } else {
            return absx * -1.0f;
        }
    }

By quick-benc (and another benchmark utility which measures rdtsc/val), when atan2() is moved out from condition statement it results much lower performance:

    float copysign(float x, float y){
        float absx = std::fabs(x);
        float atan2y = std::atan2(y, -1.f); /* use atan2 to distinguish -0. from 0. */
        if (y > 0.f || (y == 0.f && atan2y > 0.f)) {
            return absx;
        } else {
            return absx *= -1.0f;
        }
    }

Functions in question in Compiler Explorer GCC / Clang

Quick C Benchmarks:

GCC 11.2: (-O3 -ffast-math / -O3)

CLANG 13.0 (-Ofast / -O3)

Another Benchmark utility results (rdtsc/val) for copysign() and round() implementation:

copysign():

~0.60 for "strd"  (std::copysign(), same with copysign())
~3.12 for "cs1"   (atan2() within condition statement)
~17.5 for "cs2"   (atan2() as variable in condition statement)

and when used in linked round() implementation:

~6.67 for case "strd"
~2.56 for case "cs1" 
~20.5 for case "cs2"

Q: What happens to atan2() when used within condition statement? Is it just inlined?

GCC has warning regarding -ffast-math usage:

This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

Q: Is this reason for lower performance when -ffast-math enabled (see linked quick-bench for GCC)?

CodePudding user response：

Two major things here separate from the title question:

Your benchmark on quick-bench looks flawed, allowing a lot of constant-propagation into your copysign implementations. Check the asm. Maybe use one global array; the compiler won't know whether something has modified it or not. And maybe generate a random FP value (once outside the loop) to apply the sign to.
x=rand(); for(...) something = copysign(x, arr[i]);
It's weird using copysign(x,x) - a smart compiler could optimize that down to just x without doing any work, if the function correctly produces that output.
-ffast-math implies -fno-signed-zeros, breaking your copysign implementation by not caring about signed-zero semantics when optimizing. (y == 0.0f being true lets the compiler assume it actually is 0.0f if it wants, ignoring the possibility of -0.0f.)

Title question: atan2 inside a conditional statement.

Short circuit eval of y == 0.f && std::atan2(y, -1.f) > 0.f only runs atan2 when y==0.f. This is rare, so usually it doesn't happen. (Without -ffast-math, it does still happen, though. But with -ffast-math, the compiler assumes y is actually 0.0f if that ==0 comparison is true, allowing it to do constant-propagation through atan2 and remove the call.)

In the fast-math case, further constant-propagation turns the atan2() > 0.f condition into a constant true, resulting in basically zero asm instructions.

You already linked the asm on godbolt; there is no call to atan2 anywhere in copysign_1, all it's doing is return (y>0.0f || y == 0.0f) ? fabs(x) : -fabs(x). (Because you compiled with -ffast-math, which implies -fno-signed-zeros, so the atan2 stuff can fully optimize away, except for a strange ja over a jne instead of just a single jae or jnae.)

In copysign_2, GCC with -ffast-math does manage to pull the float atan2y = std::atan2(y, -1.f); into the conditional where it's used. Clang is also able to do that; it's weird that you linked a clang benchmark but GCC asm.

But without -ffast-math (like you used on Quick-bench), clang 13 isn't allowed to do that, so it unconditionally calls atan2 for every input. That's because the default is -fmath-errno, unfortunately.

The call to atan2 with an arbitrary y value is potentially a visible side-effect because it will set errno for invalid input values. I think? Maybe not; the atan2 man page says it has no errors (no cases that set errno), and in the description only returning NaN if x or y is NaN. Maybe compilers don't realize that? But for whatever reason, using -fno-math-errno lets GCC and clang pull the your unconditional atan2 inside the branch so even though it will actually call it, it only happens for the rare y == 0.0 case (i.e. y being 0.0f or -0.0f)

(I thought I remembered clang not defaulting to math-errnor, only GCC, but maybe they changed or I'm mixing it up with -fno-trapping-math. You should always compile with -fno-math-errno unless your program very strangely wants to check errno for EDOM or ERANGE or similar FP errors after library function calls even including sqrt, instead of using fenv.h to test the FP environment flags.)

Also -ftrapping-math is unfortunately on by default even though it doesn't fully work correctly. It may try to treat it as something that may potentially run SIGFPE signal-handler code, so potentially a visible side-effect. Or at least as maybe setting a sticky bit in the FP environment that other code could read with fenv.

BTW, this is how GCC compiled your copysign_1:

# gcc -O3 -ffast-math
copysign_1(float, float):
        comiss  xmm1, DWORD PTR .LC1[rip]            # compare y against 0.0f
        andps   xmm0, XMMWORD PTR .LC0[rip]          # clear the sign bit in x
        ja      .L3                                  # if (y>0)  return fabs(x)
        jne     .L5                                  # if (y!=0) return -fabs(x)
.L3:
        ret
.L5:
        xorps   xmm0, XMMWORD PTR .LC2[rip]          # flip the sign bit of x
        ret

As you can see, it's broken for negative zero, -0.0f, clearing the sign bit of the output in that case. Compilers of course make correct asm without -ffast-math, but slower and actually calling atan2 for -0.0 or 0.0.

BTW, despite Quick-Bench's lack of compiler option setting, it does have -Ofast which is currently equivalent to -O3 -ffast-math. That does make everything about the same speed except for M_STDNO. Not sure why that's slower; the asm involves a lot of store/reload due to the way you're using Benchmark::DoNotOptimize, but still allowing significant constant-propagation through your copysign functions. (So they're just doing either an andps or orps to clear or set the sign bit of something.)

This is vastly over-complicated for IEEE floats

Since you're apparently using C 20, you might as well std::bit_cast<uint32_t>(x) and y to integer and use bitwise operations, at least after checking that float has its sign bit at the top or something. (e.g. static_assert that -0.0f has the bit-pattern 0x80000000; you can use std::bit_cast in constexpr stuff like static_assert.)

You don't want any branching or multiplying, you just want (u32x & 0x7FFFFFFFU) | (u32y & 0x80000000U) to merge the sign bit of y with the exponent and mantissa of x.

CodePudding user response：

What happens to atan2() when used within condition statement? Is it just inlined?

There was no call to std::atan2 in assembly, hence we know that it must have been expanded inline (it could have been also eliminated as dead code, but we can deduce that's not the case).

With -ffast-math enabled, y > 0.f || (y == 0.f && std::atan2(y, -1.f) > 0.f) was essentially compiled down to y >= 0 because the compiler didn't care to consider the difference between -0 and 0. This is an example of "can result in incorrect output". If you want to treat -0 as per IEEE/ISO, the don't use -ffast-math.