Bumped to this "anomaly" while bench-marking one implementation of copysign() function within round() implementation:
float copysign(float x, float y){
float absx = std::fabs(x);
/* use atan2 to distinguish -0. from 0. */
if (y > 0.f || (y == 0.f && std::atan2(y, -1.f) > 0.f)) {
return absx;
} else {
return absx * -1.0f;
}
}
By quick-benc (and another benchmark utility which measures rdtsc/val), when atan2() is moved out from condition statement it results much lower performance:
float copysign(float x, float y){
float absx = std::fabs(x);
float atan2y = std::atan2(y, -1.f); /* use atan2 to distinguish -0. from 0. */
if (y > 0.f || (y == 0.f && atan2y > 0.f)) {
return absx;
} else {
return absx *= -1.0f;
}
}
Functions in question in Compiler Explorer GCC / Clang
Quick C Benchmarks:
GCC 11.2: (-O3 -ffast-math / -O3)
Another Benchmark utility results (rdtsc/val) for copysign() and round() implementation:
copysign():
~0.60 for "strd" (std::copysign(), same with copysign())
~3.12 for "cs1" (atan2() within condition statement)
~17.5 for "cs2" (atan2() as variable in condition statement)
and when used in linked round() implementation:
~6.67 for case "strd"
~2.56 for case "cs1"
~20.5 for case "cs2"
Q: What happens to atan2() when used within condition statement? Is it just inlined?
GCC has warning regarding -ffast-math usage:
This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.
Q: Is this reason for lower performance when -ffast-math enabled (see linked quick-bench for GCC)?
CodePudding user response:
Two major things here separate from the title question:
Your benchmark on quick-bench looks flawed, allowing a lot of constant-propagation into your
copysign
implementations. Check the asm. Maybe use one global array; the compiler won't know whether something has modified it or not. And maybe generate a random FP value (once outside the loop) to apply the sign to.
x=rand(); for(...) something = copysign(x, arr[i]);
It's weird usingcopysign(x,x)
- a smart compiler could optimize that down to justx
without doing any work, if the function correctly produces that output.-ffast-math
implies-fno-signed-zeros
, breaking your copysign implementation by not caring about signed-zero semantics when optimizing. (y == 0.0f
being true lets the compiler assume it actually is 0.0f if it wants, ignoring the possibility of -0.0f.)
Title question: atan2 inside a conditional statement.
Short circuit eval of y == 0.f && std::atan2(y, -1.f) > 0.f
only runs atan2
when y==0.f
. This is rare, so usually it doesn't happen. (Without -ffast-math
, it does still happen, though. But with -ffast-math
, the compiler assumes y
is actually 0.0f if that ==0
comparison is true, allowing it to do constant-propagation through atan2
and remove the call.)
In the fast-math case, further constant-propagation turns the atan2() > 0.f
condition into a constant true
, resulting in basically zero asm instructions.
You already linked the asm on godbolt; there is no call to atan2
anywhere in copysign_1
, all it's doing is return (y>0.0f || y == 0.0f) ? fabs(x) : -fabs(x)
. (Because you compiled with -ffast-math
, which implies -fno-signed-zeros
, so the atan2 stuff can fully optimize away, except for a strange ja
over a jne
instead of just a single jae
or jnae
.)
In copysign_2
, GCC with -ffast-math
does manage to pull the float atan2y = std::atan2(y, -1.f);
into the conditional where it's used. Clang is also able to do that; it's weird that you linked a clang benchmark but GCC asm.
But without -ffast-math
(like you used on Quick-bench), clang 13 isn't allowed to do that, so it unconditionally calls atan2
for every input. That's because the default is -fmath-errno
, unfortunately.
The call to atan2
with an arbitrary y
value is potentially a visible side-effect because it will set errno
for invalid input values. I think? Maybe not; the atan2
man page says it has no errors (no cases that set errno), and in the description only returning NaN if x or y is NaN. Maybe compilers don't realize that? But for whatever reason, using -fno-math-errno
lets GCC and clang pull the your unconditional atan2
inside the branch so even though it will actually call it, it only happens for the rare y == 0.0
case (i.e. y
being 0.0f
or -0.0f
)
(I thought I remembered clang not defaulting to math-errnor, only GCC, but maybe they changed or I'm mixing it up with -fno-trapping-math
. You should always compile with -fno-math-errno
unless your program very strangely wants to check errno
for EDOM
or ERANGE
or similar FP errors after library function calls even including sqrt
, instead of using fenv.h
to test the FP environment flags.)
Also -ftrapping-math
is unfortunately on by default even though it doesn't fully work correctly. It may try to treat it as something that may potentially run SIGFPE signal-handler code, so potentially a visible side-effect. Or at least as maybe setting a sticky bit in the FP environment that other code could read with fenv
.
BTW, this is how GCC compiled your copysign_1:
# gcc -O3 -ffast-math
copysign_1(float, float):
comiss xmm1, DWORD PTR .LC1[rip] # compare y against 0.0f
andps xmm0, XMMWORD PTR .LC0[rip] # clear the sign bit in x
ja .L3 # if (y>0) return fabs(x)
jne .L5 # if (y!=0) return -fabs(x)
.L3:
ret
.L5:
xorps xmm0, XMMWORD PTR .LC2[rip] # flip the sign bit of x
ret
As you can see, it's broken for negative zero, -0.0f
, clearing the sign bit of the output in that case. Compilers of course make correct asm without -ffast-math
, but slower and actually calling atan2
for -0.0
or 0.0
.
BTW, despite Quick-Bench's lack of compiler option setting, it does have -Ofast
which is currently equivalent to -O3 -ffast-math
. That does make everything about the same speed except for M_STDNO
. Not sure why that's slower; the asm involves a lot of store/reload due to the way you're using Benchmark::DoNotOptimize
, but still allowing significant constant-propagation through your copysign
functions. (So they're just doing either an andps
or orps
to clear or set the sign bit of something.)
This is vastly over-complicated for IEEE floats
Since you're apparently using C 20, you might as well std::bit_cast<uint32_t>(x)
and y
to integer and use bitwise operations, at least after checking that float
has its sign bit at the top or something. (e.g. static_assert
that -0.0f
has the bit-pattern 0x80000000
; you can use std::bit_cast
in constexpr stuff like static_assert.)
You don't want any branching or multiplying, you just want (u32x & 0x7FFFFFFFU) | (u32y & 0x80000000U)
to merge the sign bit of y with the exponent and mantissa of x.
CodePudding user response:
What happens to atan2() when used within condition statement? Is it just inlined?
There was no call to std::atan2
in assembly, hence we know that it must have been expanded inline (it could have been also eliminated as dead code, but we can deduce that's not the case).
With -ffast-math
enabled, y > 0.f || (y == 0.f && std::atan2(y, -1.f) > 0.f)
was essentially compiled down to y >= 0
because the compiler didn't care to consider the difference between -0 and 0. This is an example of "can result in incorrect output". If you want to treat -0 as per IEEE/ISO, the don't use -ffast-math
.