Why does gcc implement fmin and fmax in three different ways?-CodePudding

I have a few routines here that all do the same thing: they clamp a float to the range [0,65535]. What surprises me is that the compiler (gcc -O3) uses three, count 'em, three different ways to implement float-min and float-max. I'd like to understand why it generates three different implementations. Ok, here's the C code:

float clamp1(float x) {
    x = (x < 0.0f) ? 0.0f : x;
    x = (x > 65535.0f) ? 65535.0f : x;
    return x;
}

float clamp2(float x) {
    x = std::max(0.0f, x);
    x = std::min(65535.0f, x);
    return x;
}

float clamp3(float x) {
    x = std::min(65535.0f, x);
    x = std::max(0.0f, x);
    return x;
}

So here's the generated assembly (with some of the boilerplate removed). Reproducible on https://godbolt.org/z/db775on4j with GCC10.3 -O3. (Also showing clang14's choices.)

CLAMP1:
    movaps  %xmm0, %xmm1
    pxor    %xmm0, %xmm0
    comiss  %xmm1, %xmm0
    ja  .L9
    movss   .LC1(%rip), %xmm0     # 65535.0f
    movaps  %xmm0, %xmm2
    cmpltss %xmm1, %xmm2
    andps   %xmm2, %xmm0
    andnps  %xmm1, %xmm2
    orps    %xmm2, %xmm0
.L9:
    ret

CLAMP2:
    pxor    %xmm1, %xmm1
    comiss  %xmm1, %xmm0
    ja  .L20
    pxor    %xmm0, %xmm0
    ret
.L20:
    minss   .LC1(%rip), %xmm0      # 65535.0f
    ret

CLAMP3:
    movaps  %xmm0, %xmm1
    movss   .LC1(%rip), %xmm0      # 65535.0f
    comiss  %xmm1, %xmm0
    ja  .L28
    ret
.L28:
    maxss   .LC2(%rip), %xmm1      # 0.0f
    movaps  %xmm1, %xmm0
    ret

So there appear to be three different implementations of MIN and MAX here:

using compare-and-branch
using minss and maxss
using compare, andps, andnps, and orps.

Can somebody clarify the following:

Are these all the same speed, or is one of them faster?
How does the compiler end up choosing all these different implementations?
What exactly is that thing with the andps, andnps, and so forth?
Would using both minss and maxss, and no branches, be faster?

CodePudding user response：

Are these all the same speed, or is one of them faster?

It's not the same, but the difference depends on the machine and the given input.

How does the compiler end up choosing all these different implementations?

Because the compiler is not smart as you think.

What exactly is that thing with the andps, andnps, and so forth?

It's the following logic.

float y = 65535;
float m = std::bit_cast<float>(-(x < y));
return x & m | y & ~m;

You can't do & on floats in C (and C), but you'll get the idea.

Would using both minss and maxss, and no branches, be faster?

Usually, yes. You can look at the latency and throughput for each instruction at (https://uops.info/table.html). I'd write the following in assembly.

xorps xmm1, xmm1
minss xmm0, [_ffff]
maxss xmm0, xmm1

However, if you know that a certain branch is very likely to be taken, jumping over a with a branch can be faster. Let's see what the compiler does with branching hints.

float clamp_minmax(float x) {
  return std::max(std::min(x, 65535.0f), 0.0f);
}

float clamp_hint(float x) {
  if (__builtin_expect(x > 65535, 0)) {
    return 65535;
  }
  if (__builtin_expect(x < 0, 0)) {
    return 0;
  }
  return x;
}

GCC basically doesn't care, but Clang interestingly produces different code.

clamp_minmax(float):
        movss   xmm1, dword ptr [rip   .LCPI0_0]
        minss   xmm1, xmm0
        xorps   xmm0, xmm0
        maxss   xmm0, xmm1
        ret

clamp_hint(float):
        movaps  xmm1, xmm0
        movss   xmm0, dword ptr [rip   .LCPI1_0]
        ucomiss xmm1, xmm0
        ja      .LBB1_2
        xorps   xmm0, xmm0
        maxss   xmm0, xmm1
.LBB1_2:
        ret

ucomiss -> ja is not faster than a single minss, but if you can jump over maxss most of the time by branching, the branched version will probably run faster than the branchless version because you can skip the latency of maxss which is unavoidable in the branchless version due to its dependency on the previous minss.