Strength reduction on floating point division by hand-CodePudding

In one of our last assignments in Computer Science this term we have to apply strength reduction on some code fragments. Most of them were just straight forward, especially with looking into compiler output. But one of them I wont be able to solve, even with the help of the compiler.

Our profs gave us the following hint:

Hint: Inquire how IEEE 754 single-precision floating-point numbers are represented in memory.

Here is the code snippet: (a is of type double*)

for (int i = 0; i < N;   i) {
    a[i]  = i / 5.3;   
}

At first I tried to look into the compiler output for this snipped on godbolt. I tried to compile it without any optimization: (note: I copied only the relevant part in the for loop)

    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0 rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    movsd   xmm1, QWORD PTR [rax]
    cvtsi2sd        xmm0, DWORD PTR [rbp-4]    //division relevant
    movsd   xmm2, QWORD PTR .LC0[rip]          //division relevant
    divsd   xmm0, xmm2                         //division relevant
    mov     eax, DWORD PTR [rbp-4]
    cdqe
    lea     rdx, [0 rax*8]
    mov     rax, QWORD PTR [rbp-16]
    add     rax, rdx
    addsd   xmm0, xmm1
    movsd   QWORD PTR [rax], xmm0

and with -O3:

.L2:
        pshufd  xmm0, xmm2, 238             //division relevant
        cvtdq2pd        xmm1, xmm2          //division relevant
        movupd  xmm6, XMMWORD PTR [rax]
        add     rax, 32
        cvtdq2pd        xmm0, xmm0          //division relevant
        divpd   xmm1, xmm3                  //division relevant
        movupd  xmm5, XMMWORD PTR [rax-16]
        paddd   xmm2, xmm4
        divpd   xmm0, xmm3                  //division relevant
        addpd   xmm1, xmm6
        movups  XMMWORD PTR [rax-32], xmm1
        addpd   xmm0, xmm5
        movups  XMMWORD PTR [rax-16], xmm0
        cmp     rax, rbp
        jne     .L2

I commented the division part of the assembly code. But this output does not help me understanding how to apply strength reduction on the snippet. (Maybe there are too many optimizations going on to fully understand the output)

Second, I tried to understand the bit representation of the float part 5.3. Which is:

0   |10000001|01010011001100110011010
Sign|Exponent|Mantissa

But this does not help me either.

CodePudding user response：

If we adopt Wikipedia's definition that

strength reduction is a compiler optimization where expensive operations are replaced with equivalent but less expensive operations

then we can apply strength reduction here by converting the expensive floating-point division into a floating-point multiply plus two floating-point multiply-adds (FMAs). Assuming that double is mapped to IEEE-754 binary64, the default rounding mode for floating-point computation is round-to-nearest-or-even, and that int is a 32-bit type, we can prove the transformation correct by simple exhaustive test:

#include <stdio.h>
#include <stdlib.h>
#include <limits.h>
#include <math.h>

int main (void)
{
    const double rcp_5p3 = 1.0 / 5.3; // 0x1.826a439f656f2p-3
    int i = INT_MAX;
    do {
        double ref = i / 5.3;
        double res = fma (fma (-5.3, i * rcp_5p3, i), rcp_5p3, i * rcp_5p3);
        if (res != ref) {
            printf ("error: i=-  res=#.13a  ref=#.13a\n", i, res, ref);
            return EXIT_FAILURE;
        }
        i--;
    } while (i >= 0);
    return EXIT_SUCCESS;
}

Most modern instances of common processors architectures like x86-64 and ARM64 have hardware support for FMA, such that fma() can be mapped directly to the appropriate hardware instruction. This should be confirmed by looking at the disassembly of the binary generated. Where hardware support for FMA is lacking the transformation obviously should not be applied, as software implementations of fma() are slow and sometimes functionally incorrect.

The basic idea here is that mathematically, division is equivalent to multiplication with the reciprocal. However, that is not necessarily true for finite-precision floating-point arithmetic. The code above tries to improve the likelihood of bit-accurate computation by determining the error in the naive approach with the help of FMA and applying a correction where necessary. For background including literature references see this earlier question.

To the best of my knowledge, there is not yet a general mathematically proven algorithm to determine for which divisors paired with which dividends the above transformation is safe (that is, delivers bit-accurate results), which is why an exhaustive test is strictly necessary to show that the transformation is valid.

CodePudding user response：

i / 5.3 only needs to be computed for odd values of i. For even values of i, you can simply multiply by 2.0 the value of (i/2)/5.3, which was already computed earlier in the loop.

The remaining difficulty is to reorder the iterations in a way such that each index between 0 and N-1 is handled exactly once and the program does not need to record an arbitrary number of division results.

One way to achieve this is to iterate on all odd numbers o less than N, and after computing o / 5.3 in order to handle index o, to also handle all indexes of the form o * 2**p.

if (N > 0) {
  a[0]  = 0.0; // this is needed for strict IEEE 754 compliance lol
  for (int o = 1; o < N; o  = 2) {
    double d = o / 5.3;
    int i = o;
    do { 
      a[i]  = d;
      i  = i;
      d  = d;
    } while (i < N);
  }
}

Note: this does not use the provided hint “Inquire how IEEE 754 single-precision floating-point numbers are represented in memory”. I think I know pretty well how single-precision floating-point numbers are represented in memory, but I do not see how that is relevant, especially since there are no single-precision values or computations in the code to optimize. I think there is a mistake in the way the problem is expressed, but still the above is technically a partial answer to the question as phrased.

I also ignored overflow problems for values of N that come close to INT_MAX in the code snippet above, since the code is already complicated enough.

As an additional note, the above transformation only replaces one division out of two. It does this by making the code unvectorizable (and also less cache-friendly). In your question, gcc -O3 has already shown that automatic vectorization could be applied to the starting point that your professor suggested, and that is likely to be more beneficial than suppressing half the divisions can be. The only good thing about the transformation in this answer is that it is a sort of strength reduction, which your professor requested.