Fast integer division and modulo with a const runtime divisor-CodePudding

int n_attrs = some_input_from_other_function() // [2..5000]
vector<int> corr_indexes; // size = n_attrs * n_attrs
vector<bool> selected; // szie = n_attrs
vector<pair<int,int>> selectedPairs; // size = n_attrs / 2
// vector::reserve everything here
...
// optimize the code below
const int npairs = n_attrs * n_attrs;
selectedPairs.clear();
for (int i = 0; i < npairs; i  ) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache
    
    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.emplace_back(x, y);
    if (selectedPairs.size() == n_attrs / 2) break;
}

I have a function that looks like this. The bottleneck is in

    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;

n_attrs is const during the loop, so I wish to find a way to speed up this loop. corr_indexes[i], n_attrs > 0, < max_int32. Edit: please note that n_attrs isn't compile-time const.

How can I optimize this loop? No extra library is allowed. Also, is their any way to parallelize this loop (either CPU or GPU are okay, everything is already on GPU memory before this loop).

CodePudding user response：

I am restricting my comments to integer division, because to first order the modulo operation in C can be viewed and implemented as an integer division plus backmultiply and subtraction, although in some cases, there are cheaper ways of computing the modulo directly, e.g. when computing modulo 2ⁿ.

Integer division is pretty slow on most platforms, based on either software emulation or an iterative hardware implementation. But it was widely reported last year that based on microbenchmarking on Apple's M1, it has a blazingly fast integer division, presumably by using dedicated circuitry.

Ever since a seminal paper by Torbjörn Granlund and Peter Montgomery almost thirty years ago it has been widely known how to replace integer divisions with constant divisors by using an integer multiply plus possibly a shift and / or other correction steps. This algorithm is often referred to as the magic-multiplier technique. It requires precomputation of some relevant parameters from the integer divisor for use in the multiply-based emulation sequence.

Torbjörn Granlund and Peter L. Montgomery, "Division by invariant integers using multiplication," ACM SIGPLAN Notices, Vol. 29, June 1994, pp. 61-72 (online).

At current, all major tool chains incorporate variants of the Granlund-Montgomery algorithm when dealing with integer divisors that are compile-time constant. The pre-computation occurs at compilation time inside the compiler, which then emits code using the computed parameters. Some toolchains may also use this algorithm for divisions by run-time constant divisors that are used repeatedly. For run-time constant divisors in loops, this could involve emitting a pre-computation block prior to a loop to compute the necessary parameters, and then using those for the division emulation code inside the loop.

If one's toolchain does not optimize divisions with run-time constant divisor one can use the same approach manually as demonstrated by the code below. However, this is unlikely to achieve the same efficiency as a compiler-based solution, because not all machine operations used in the desired emulation sequence can be expressed efficiently at C level in a portable manner. This is applies in particular to arithmetic right shifts and add-with-carry.

The code below demonstrates the principle of parameter pre-computation and integer division emulation via multiplication. It is quite likely that by investing more time into the design than I was willing to expend for this answer more efficient implementations of both parameter precomputation and emulation can be identified.

#include <cstdio>
#include <cstdlib>
#include <cstdint>

#define PORTABLE  (1)
#define ADD_FLAG  (1)
#define NEG_FLAG  (2)

uint32_t ilog2 (uint32_t i)
{
    uint32_t t = 0;
    i = i >> 1;
    while (i) {
        i = i >> 1;
        t  ;
    }
    return (t);
}

/* Based on: Granlund, T.; Montgomery, P.L.: "Division by Invariant Integers 
   using Multiplication". SIGPLAN Notices, Vol. 29, June 1994, page 61.
*/
void prepare_magic (int32_t divisor, int32_t &multiplier, int32_t &shift, 
                    int32_t &flags)
{
    uint32_t d, i;
    uint64_t m_lower, m_upper, j, k, msb;

    d = (uint32_t)llabs (divisor);
    i = ilog2 (d);
    msb = (((uint64_t)(1)) << (32   i));
    j = (((uint64_t)(0x80000000)) % ((uint64_t)(d)));
    k = msb / ((uint64_t)(0x80000000 - j));
    m_lower = msb / d;
    m_upper = (msb   k) / d;
    while (((m_lower >> 1) < (m_upper >> 1)) && (i > 0)) {
        m_lower = m_lower >> 1;
        m_upper = m_upper >> 1;
        i--;
    }
    multiplier = (uint32_t)(m_upper);
    shift = i;
    flags = ((m_upper >> 31) ? ADD_FLAG : 0) | ((divisor < 0) ? NEG_FLAG : 0);
}

int32_t arithmetic_right_shift (int32_t a, int32_t s)
{
    uint32_t mask_msb = 0x80000000;
    uint32_t ua = (uint32_t)a;
    ua = ua >> s;
    mask_msb = mask_msb >> s;
    return (int32_t)((ua ^ mask_msb) - mask_msb);
}

int32_t magic_division (int32_t dividend, int32_t multiplier, int32_t shift, 
                        int32_t flags)
{
    int64_t prod = ((int64_t)dividend) * multiplier;
    int32_t quot = (int32_t)(((uint64_t)prod) >> 32);
    if (flags & ADD_FLAG) quot = (uint32_t)quot   (uint32_t)dividend;
#if PORTABLE
    quot = arithmetic_right_shift (quot, shift);
#else // PORTABLE
    quot = quot >> shift;  // must use arithmetic right shift
#endif // PORTABLE
    quot = quot   ((uint32_t)dividend >> 31);
    if (flags & NEG_FLAG) quot = -quot;
    return quot;
}

int main (void)
{
    int32_t multiplier;
    int32_t shift;
    int32_t flags;
    int32_t divisor;
    
    for (divisor = -10; divisor <= 10; divisor  ) {
        /* avoid division by zero */
        if (divisor == 0) {
            divisor  ;
            continue;
        }
        printf ("divisor=%d\n", divisor);
        prepare_magic (divisor, multiplier, shift, flags);
        printf ("multiplier=%d shift=%d flags=%d\n", 
                multiplier, shift, flags);
        printf ("exhaustive test of dividends ... ");
        uint32_t dividendu = 0;
        do {
            int32_t dividend = (int32_t)dividendu;
            /* avoid overflow in signed integer division */
            if ((divisor == (-1)) && (dividend == ((-2147483647)-1))) {
                dividendu  ;
                continue;
            }
            int32_t res = magic_division (dividend, multiplier, shift, flags);
            int32_t ref = dividend / divisor;
            if (res != ref) {
                printf ("\nERR dividend=%d (x) divisor=%d  res=%d  ref=%d\n",
                        dividend, (uint32_t)dividend, divisor, res, ref);
                return EXIT_FAILURE;
            }
            dividendu  ;
        } while (dividendu);
        printf ("PASSED\n");
    }
    return EXIT_SUCCESS;
}

CodePudding user response：

I would optimize the part after // optimize the code below by:

taking n_attrs
generating a function string like this:

void dynamicFunction(MyType & selectedPairs, Foo & selected)
{
    const int npairs = @@ * @@;
    selectedPairs.clear();
    for (int i = 0; i < npairs; i  ) {
        const int x = corr_indexes[i] / @@;
        const int y = corr_indexes[i] % @@;
        if (selected[x] || selected[y]) continue; // fit inside L1 cache
    
        // below lines are called max 2500 times, so they're insignificant
        selected[x] = true;
        selected[y] = true;
        selectedPairs.emplace_back(x, y);
        if (selectedPairs.size() == @@ / 2) 
            break;
    }
}

replacing all @@ with value of n_attrs
compiling it, generating a DLL
linking and calling the function

So that the n_attrs is a compile-time constant value for the DLL and the compiler can automatically do most of its optimization on the value like:

doing n&(x-1) instead of n%x when x is power-of-2 value
shifting and multiplying instead of dividing
maybe other optimizations too, like unrolling the loop with precalculated indices for x and y (since x is known)

Some integer math operations in tight-loops are easier to SIMDify/vectorize by compiler when more of the parts are known in compile-time.

If your CPU is AMD, you can even try magic floating-point operations in place of unknown/unknown division to get vectorization.

By caching all (or big percentage of) values of n_attrs, you can get rid of latencies of:

string generation
compiling
file(DLL) reading (assuming some object-oriented wrapping of DLLs)

If the part to be optimized will be run in GPU, there is high possibility of CUDA/OpenCL implementation already doing the integer division in means of floating-point (to keep SIMD path occupied instead of being serialized on integer division) or just being capable directly as SIMD integer operations so you may just use it as it is in the GPU, except the std::vector which is not supported by all C CUDA compilers (and not in OpenCL kernel). These host-environment-related parts could be computed after the kernel (with the parts excluding emplace_back or exchanged with a struct that works in GPU) is executed.

CodePudding user response：

As has been already suggested, the compiler (gnu 10.2 in my case) seems to have some heuristics for modulo- or divide-by-constant when the constant is a compile-time constant. Here's a simple example:

$ cat t12.cpp
#include <vector>
#include <cstdlib>

void test(std::vector<std::pair<int,int> > &selectedPairs, std::vector<int> &corr_indexes, std::vector<bool> &selected, int n_attrs){

  // optimize the code below
  const int npairs = n_attrs * n_attrs;
  selectedPairs.clear();
  for (int i = 0; i < npairs; i  ) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache

    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.push_back(std::pair<int,int>(x, y));
    if (selectedPairs.size() == n_attrs / 2) break;
  }
}

template <int n_attrs>
void ttest(std::vector<std::pair<int,int> > &selectedPairs, std::vector<int> &corr_indexes, std::vector<bool> &selected){

  // optimize the code below
  const int npairs = n_attrs * n_attrs;
  selectedPairs.clear();
  for (int i = 0; i < npairs; i  ) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache

    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.push_back(std::pair<int,int>(x, y));
    if (selectedPairs.size() == n_attrs / 2) break;
  }
}

int main(int argc, char *argv[]){

  int n_attrs = 3;
  if (argc > 1) n_attrs = atoi(argv[1]);
  std::vector<int> corr_indexes(n_attrs*n_attrs, 3);
  std::vector<std::pair<int,int> > selectedPairs;
  std::vector<bool> selected(n_attrs, true);
  // could replace following with a switch-case jump-table
  if (n_attrs == 10000) ttest<10000>(selectedPairs, corr_indexes, selected);
  else test(selectedPairs, corr_indexes, selected, n_attrs);
}
$ g   -O3 t12.cpp -o t12
$ time ./t12 9999

real    0m0.494s
user    0m0.439s
sys     0m0.055s
$ time ./t12 10000

real    0m0.310s
user    0m0.259s
sys     0m0.051s
$

(CPU: i5-4690K)

If you combine that with boost preprocessor magic as suggested here to build a switch-case jump-table, you can probably achieve some benefit/speed-up.

Here is a partial implementation of that idea:

#include <vector>
#include <cstdlib>
#include <boost/preprocessor/repetition/repeat.hpp>
#define MY_PASTE0(rep, n, _)   case n:     ttest<n>    (selectedPairs, corr_indexes, selected); break;
#define MY_PASTE250(rep, n, _) case n 250: ttest<n 250>(selectedPairs, corr_indexes, selected); break;
#define MY_PASTE500(rep, n, _) case n 500: ttest<n 500>(selectedPairs, corr_indexes, selected); break;
#define MY_LIMIT 250

void test(std::vector<std::pair<int,int> > &selectedPairs, std::vector<int> &corr_indexes, std::vector<bool> &selected, int n_attrs){

  // optimize the code below
  const int npairs = n_attrs * n_attrs;
  selectedPairs.clear();
  for (int i = 0; i < npairs; i  ) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache

    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.push_back(std::pair<int,int>(x, y));
    if (selectedPairs.size() == n_attrs / 2) break;
  }
}

template <int n_attrs>
void ttest(std::vector<std::pair<int,int> > &selectedPairs, std::vector<int> &corr_indexes, std::vector<bool> &selected){

  // optimize the code below
  const int npairs = n_attrs * n_attrs;
  selectedPairs.clear();
  for (int i = 0; i < npairs; i  ) {
    const int x = corr_indexes[i] / n_attrs;
    const int y = corr_indexes[i] % n_attrs;
    if (selected[x] || selected[y]) continue; // fit inside L1 cache

    // below lines are called max 2500 times, so they're insignificant
    selected[x] = true;
    selected[y] = true;
    selectedPairs.push_back(std::pair<int,int>(x, y));
    if (selectedPairs.size() == n_attrs / 2) break;
  }
}

int main(int argc, char *argv[]){

  int n_attrs = 3;
  if (argc > 1) n_attrs = atoi(argv[1]);
  std::vector<int> corr_indexes(n_attrs*n_attrs, 3);
  std::vector<std::pair<int,int> > selectedPairs;
  std::vector<bool> selected(n_attrs, true);
  int i = n_attrs;
  switch(i) {
    BOOST_PP_REPEAT(MY_LIMIT, MY_PASTE0, _)
    BOOST_PP_REPEAT(MY_LIMIT, MY_PASTE250, _)
    BOOST_PP_REPEAT(MY_LIMIT, MY_PASTE500, _)
    // ...
    default:
      test(selectedPairs, corr_indexes, selected, i);
    }
}