I have a few routines here that all do the same thing: they clamp a float to the range [0,65535]. What surprises me is that the compiler (gcc -O3) uses three, count 'em, three different ways to implement float-min and float-max. I'd like to understand why it generates three different implementations. Ok, here's the C code:
float clamp1(float x) {
x = (x < 0.0f) ? 0.0f : x;
x = (x > 65535.0f) ? 65535.0f : x;
return x;
}
float clamp2(float x) {
x = std::max(0.0f, x);
x = std::min(65535.0f, x);
return x;
}
float clamp3(float x) {
x = std::min(65535.0f, x);
x = std::max(0.0f, x);
return x;
}
So here's the generated assembly (with some of the boilerplate removed). Reproducible on https://godbolt.org/z/db775on4j with GCC10.3 -O3
. (Also showing clang14's choices.)
CLAMP1:
movaps %xmm0, %xmm1
pxor %xmm0, %xmm0
comiss %xmm1, %xmm0
ja .L9
movss .LC1(%rip), %xmm0 # 65535.0f
movaps %xmm0, %xmm2
cmpltss %xmm1, %xmm2
andps %xmm2, %xmm0
andnps %xmm1, %xmm2
orps %xmm2, %xmm0
.L9:
ret
CLAMP2:
pxor %xmm1, %xmm1
comiss %xmm1, %xmm0
ja .L20
pxor %xmm0, %xmm0
ret
.L20:
minss .LC1(%rip), %xmm0 # 65535.0f
ret
CLAMP3:
movaps %xmm0, %xmm1
movss .LC1(%rip), %xmm0 # 65535.0f
comiss %xmm1, %xmm0
ja .L28
ret
.L28:
maxss .LC2(%rip), %xmm1 # 0.0f
movaps %xmm1, %xmm0
ret
So there appear to be three different implementations of MIN and MAX here:
- using compare-and-branch
- using
minss
andmaxss
- using compare,
andps
,andnps
, andorps
.
Can somebody clarify the following:
- Are these all the same speed, or is one of them faster?
- How does the compiler end up choosing all these different implementations?
- What exactly is that thing with the
andps
,andnps
, and so forth? - Would using both
minss
andmaxss
, and no branches, be faster?
CodePudding user response:
Are these all the same speed, or is one of them faster?
It's not the same, but the difference depends on the machine and the given input.
How does the compiler end up choosing all these different implementations?
Because the compiler is not smart as you think.
What exactly is that thing with the andps, andnps, and so forth?
It's the following logic.
float y = 65535;
float m = std::bit_cast<float>(-(x < y));
return x & m | y & ~m;
You can't do &
on floats in C (and C), but you'll get the idea.
Would using both minss and maxss, and no branches, be faster?
Usually, yes. You can look at the latency and throughput for each instruction at (https://uops.info/table.html). I'd write the following in assembly.
xorps xmm1, xmm1
minss xmm0, [_ffff]
maxss xmm0, xmm1
However, if you know that a certain branch is very likely to be taken, jumping over a with a branch can be faster. Let's see what the compiler does with branching hints.
float clamp_minmax(float x) {
return std::max(std::min(x, 65535.0f), 0.0f);
}
float clamp_hint(float x) {
if (__builtin_expect(x > 65535, 0)) {
return 65535;
}
if (__builtin_expect(x < 0, 0)) {
return 0;
}
return x;
}
GCC basically doesn't care, but Clang interestingly produces different code.
clamp_minmax(float):
movss xmm1, dword ptr [rip .LCPI0_0]
minss xmm1, xmm0
xorps xmm0, xmm0
maxss xmm0, xmm1
ret
clamp_hint(float):
movaps xmm1, xmm0
movss xmm0, dword ptr [rip .LCPI1_0]
ucomiss xmm1, xmm0
ja .LBB1_2
xorps xmm0, xmm0
maxss xmm0, xmm1
.LBB1_2:
ret
ucomiss -> ja
is not faster than a single minss
, but if you can jump over maxss
most of the time by branching, the branched version will probably run faster than the branchless version because you can skip the latency of maxss
which is unavoidable in the branchless version due to its dependency on the previous minss
.
See also, (What is the instruction that gives branchless FP min and max on x86?).