Fastest way for wrapping a value in an interval-CodePudding

I am curious about the ways to wrap a floating-point value x in a semi-closed interval [0; a[.

For instance, I could have an arbitrary real number, say x = 354638.515, that I wish to fold into [0; 2π[ because I have a good sin approximation for that range.

The fmod standard C functions show up quite high in my benchmarks, and by checking the source code of various libc implementations, I can understand why: the thing is fairly branch-ey, likely in order to handle a lot of IEEE754-specific issues:

glibc: https://github.com/bminor/glibc/blob/master/sysdeps/ieee754/flt-32/e_fmodf.c
Apple: https://opensource.apple.com/source/Libm/Libm-315/Source/ARM/fmod.c.auto.html
musl: https://git.musl-libc.org/cgit/musl/tree/src/math/fmod.c
Running with -ffast-math causes GCC to generate code that goes through the x87 FPU on x86/x86_64 which comes with its own set of problems (such as 80-bit doubles, FP state, and other fun things). I would like the implementation to vectorize at least semi-correctly, and if possible to not go through the x87 FPU but through vector registers at least as the rest of my code ends up vectorized, even if not necessarily an optimal way, by the compiler.
This one looks much simpler: https://github.com/KnightOS/libc/blob/master/src/fmod.c

In my case, I am only concerned about usual real values, not NaNs, not infinity. My range is also known at compile-time and sane (a common occurence being π/2), thus the checks for "special cases" such as range == 0 are unnecessary.

Thus, what would be good implementations of fmod for that specific use case ?

CodePudding user response：

Assuming that the range is constant and positive you can compute its reciprocal to avoid costly division.

void fast_fmod(float * restrict dst, const float * restrict src, size_t n, float divisor) {
   float reciprocal = 1.0f / divisor;
   for (size_t i = 0; i < n;   i)
     dst[i] = src[i] - divisor * (int)(src[i] * reciprocal);
}

The final code with a simple demo is:

#include <stdlib.h>
#include <stdio.h>
#include <math.h>

void fast_fmod(float * restrict dst, const float * restrict src, size_t n, float divisor) {
   float reciprocal = 1.0f / divisor;

   for (size_t i = 0; i < n;   i)
     dst[i] = src[i] - divisor * (int)(src[i] * reciprocal);
}

int main() {
    float src[9] = {-4, -3, -2, -1, 0, 1, 2, 3, 4};
    float dst[9];
    float div = 3;

    fast_fmod(dst, src, 9, div);

    for (int i = 0; i < 9;   i) {
        printf("fmod(%g, %g) = %g vs %g\n", src[i], div, dst[i], fmod(src[i], div));
    }
}

produces an expected output:

fmod(-4, 3) = -1 vs -1
fmod(-3, 3) = 0 vs -0
fmod(-2, 3) = -2 vs -2
fmod(-1, 3) = -1 vs -1
fmod(0, 3) = 0 vs 0
fmod(1, 3) = 1 vs 1
fmod(2, 3) = 2 vs 2
fmod(3, 3) = 0 vs 0
fmod(4, 3) = 1 vs 1

Compilation with GCC with command:

$ gcc prog.c -o prog -O3 -march=haswell -lm -fopt-info-vec
prog.c:8:4: optimized: loop vectorized using 32 byte vectors
prog.c:8:4: optimized: loop vectorized using 32 byte vectors
prog.c:8:30: optimized: basic block part vectorized using 32 byte vectors

Thus the code was nicely vectorized.

EDIT

It looks that CLANG does even a better job vectorizing this code:

  401170:   c5 fc 10 24 8e          vmovups (%rsi,%rcx,4),%ymm4
  401175:   c5 fc 10 6c 8e 20       vmovups 0x20(%rsi,%rcx,4),%ymm5
  40117b:   c5 fc 10 74 8e 40       vmovups 0x40(%rsi,%rcx,4),%ymm6
  401181:   c5 fc 10 7c 8e 60       vmovups 0x60(%rsi,%rcx,4),%ymm7
  401187:   c5 6c 59 c4             vmulps %ymm4,%ymm2,%ymm8
  40118b:   c5 6c 59 cd             vmulps %ymm5,%ymm2,%ymm9
  40118f:   c5 6c 59 d6             vmulps %ymm6,%ymm2,%ymm10
  401193:   c5 6c 59 df             vmulps %ymm7,%ymm2,%ymm11
  401197:   c4 41 7e 5b c0          vcvttps2dq %ymm8,%ymm8
  40119c:   c4 41 7e 5b c9          vcvttps2dq %ymm9,%ymm9
  4011a1:   c4 41 7e 5b d2          vcvttps2dq %ymm10,%ymm10
  4011a6:   c4 41 7e 5b db          vcvttps2dq %ymm11,%ymm11
  4011ab:   c4 41 7c 5b c0          vcvtdq2ps %ymm8,%ymm8
  4011b0:   c4 41 7c 5b c9          vcvtdq2ps %ymm9,%ymm9
  4011b5:   c4 41 7c 5b d2          vcvtdq2ps %ymm10,%ymm10
  4011ba:   c4 41 7c 5b db          vcvtdq2ps %ymm11,%ymm11
  4011bf:   c5 3c 59 c3             vmulps %ymm3,%ymm8,%ymm8
  4011c3:   c5 34 59 cb             vmulps %ymm3,%ymm9,%ymm9
  4011c7:   c5 2c 59 d3             vmulps %ymm3,%ymm10,%ymm10
  4011cb:   c5 24 59 db             vmulps %ymm3,%ymm11,%ymm11
  4011cf:   c4 c1 5c 5c e0          vsubps %ymm8,%ymm4,%ymm4
  4011d4:   c4 c1 54 5c e9          vsubps %ymm9,%ymm5,%ymm5
  4011d9:   c4 c1 4c 5c f2          vsubps %ymm10,%ymm6,%ymm6
  4011de:   c4 c1 44 5c fb          vsubps %ymm11,%ymm7,%ymm7
  4011e3:   c5 fc 11 24 8f          vmovups %ymm4,(%rdi,%rcx,4)
  4011e8:   c5 fc 11 6c 8f 20       vmovups %ymm5,0x20(%rdi,%rcx,4)
  4011ee:   c5 fc 11 74 8f 40       vmovups %ymm6,0x40(%rdi,%rcx,4)
  4011f4:   c5 fc 11 7c 8f 60       vmovups %ymm7,0x60(%rdi,%rcx,4)
  4011fa:   48 83 c1 20             add    $0x20,%rcx
  4011fe:   48 39 c8                cmp    %rcx,%rax
  401201:   0f 85 69 ff ff ff       jne    401170 <fast_fmod 0x40>