Why is this seemingly slower C loop actually twice as fast as the other way?-CodePudding

I'm an R developer that uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.

In R, our boolean type can actually have 3 values, true, false, and na, and we represent this using an int at the C level.

I'm looking into a vectorized && operation (yes we have this in R already, but bear with me) that also handles the na case. The scalar results would look like this:

 F && F == F
 F && T == F
 F && N == F

 T && F == F
 T && T == T
 T && N == N

 N && F == F
 N && T == N
 N && N == N

Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Now to the implementation, assume we have two vectors v_out and v_x, and we'd like to perform the vectorized && on them. We are allowed to overwrite v_out with the result. One option is:

// Option 1
for (int i = 0; i < size;   i) {
  int elt_out = v_out[i];
  int elt_x = v_x[i];

  if (elt_out == 0) {
    // Done
  } else if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

and another option is:

// Option 2
for (int i = 0; i < size;   i) {
  int elt_out = v_out[i];

  if (elt_out == 0) {
    continue;
  }

  int elt_x = v_x[i];

  if (elt_x == 0) {
    v_out[i] = 0;
  } else if (elt_out == na) {
    // Done
  } else if (elt_x == na) {
    v_out[i] = na;
  }
}

I sort of expected the second option to be faster because it avoids accessing v_x[i] when it isn't required. But in fact it was twice as slow when compiled with -O2!

In the following script, I get the following timing results. Note that I am on a Mac and compile with clang.

Seems reasonable with O0, they are about the same.
2x faster with O2 with Option 1!

Option 1, `clang -O0`
0.110560

Option 2, `clang -O0`
0.107710

Option 1, `clang -O2`
0.032223

Option 2, `clang -O2`
0.070557

Can anyone explain what is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i] is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i] is essentially being accessed randomly (sort of) because it might access v_x[10] but then not need another element from v_x until v_x[120], and because that access isn't linear, it is probably much slower.

Reproducible script:

#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>

int main() {
  srand(123);

  int size = 1e7;
  int na = INT_MIN;

  int* v_out = (int*) malloc(size * sizeof(int));
  int* v_x = (int*) malloc(size * sizeof(int));

  // Generate random numbers between 1-3
  // 1 -> false
  // 2 -> true
  // 3 -> na
  for (int i = 0; i < size;   i) {
    int elt_out = rand() % 3   1;

    if (elt_out == 1) {
      v_out[i] = 0;
    } else if (elt_out == 2) {
      v_out[i] = 1;
    } else {
      v_out[i] = na;
    }

    int elt_x = rand() % 3   1;

    if (elt_x == 1) {
      v_x[i] = 0;
    } else if (elt_x == 2) {
      v_x[i] = 1;
    } else {
      v_x[i] = na;
    }
  }

  clock_t start = clock();

  // Option 1
  for (int i = 0; i < size;   i) {
    int elt_out = v_out[i];
    int elt_x = v_x[i];

    if (elt_out == 0) {
      // Done
    } else if (elt_x == 0) {
      v_out[i] = 0;
    } else if (elt_out == na) {
      // Done
    } else if (elt_x == na) {
      v_out[i] = na;
    }
  }

  // // Option 2
  // for (int i = 0; i < size;   i) {
  //   int elt_out = v_out[i];
  //
  //   if (elt_out == 0) {
  //     continue;
  //   }
  //
  //   int elt_x = v_x[i];
  //
  //   if (elt_x == 0) {
  //     v_out[i] = 0;
  //   } else if (elt_out == na) {
  //     // Done
  //   } else if (elt_x == na) {
  //     v_out[i] = na;
  //   }
  // }

  clock_t end = clock();
  double time = (double) (end - start) / CLOCKS_PER_SEC;

  free(v_out);
  free(v_x);

  printf("%f\n", time);
  return 0;
}

CodePudding user response：

Why is this seemingly slower C loop actually twice as fast as the other way?

At a high level, it is a quirk of the compiler and execution environment you are using. Unless array v_x is declared volatile, the compiler is free to interpret the two variations on your code exactly the same way.

I sort of expected the second option to be faster because it avoids accessing v_x[i] when it isn't required.

And if the compiler's optimizer judged that to be true, then it could make use of that judgement to conditionally avoid reading v_x[i] in conjunction with the first code.

But at a lower level, if the compiler generates code that indeed does conditionally avoid reading v_x[i] in option 2 but not in option 1, then you are probably observing the effects of branch misprediction in the option 2 case. It is entirely plausible that it is cheaper on average to read v_x[i] unconditionally than to suffer large numbers of branch misprediction penalties involving whether it should be read.

One of the takeaways is that on modern hardware, branches can be a lot more expensive than one might expect, especially when the branch is difficult for the CPU to predict. Where the same computation can be performed via a branchless approach, that may yield a performance win in practice, at, usually, a cost in source code clarity. @KarlKnechtel's answer presents a possible branchless (but for testing the for loop condition, which is pretty predictable) variation on the computation you are trying to perform.

CodePudding user response：

Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.

Rather than represent the values as a strict enumeration, allow a numeric value of either 2 or 3 to represent na (you can check this when displaying, or have a normalization step after all the number-crunching). This way, no conditional logic (and thus no costly branch prediction) is needed: we simply logical-or the bit in the 2s place (regardless of the operator), and logical-and (or whichever operator) the bit in the 1s place.

int is_na(int value) {
    return value & 2;
}

void r_and_into(unsigned* v_out, unsigned* v_x, int size) {
  for (int i = 0; i < size;   i) {
    unsigned elt_out = v_out[i];
    unsigned elt_x = v_x[i];
    // this can probably be micro-optimized somehow....
    v_out[i] = (elt_out & elt_x & 1) | ((elt_out | elt_x) & 2);
  }
}

If we are forced to use INT_MIN to represent the N/A value, we can start by observing what that looks like in 2s-complement: it has exactly one bit set (the sign bit, which would be most-significant in unsigned values). Thus, we can use that bit value instead of 2 with the same kind of unconditional logic, and then correct any (INT_MIN | 1) results to INT_MIN:

const unsigned MSB_FLAG = (unsigned)INT_MIN;

void r_and_into(int* v_out, int* v_x, int size) {
  for (int i = 0; i < size;   i) {
    unsigned elt_out = (unsigned)v_out[i];
    unsigned elt_x = (unsigned)v_x[i];
    elt_out = (elt_out & elt_x & 1) | ((elt_out | elt_x) & MSB_FLAG);
    // if the high bit is set, clear the low bit
    // I.E.: AND the low bit with the negation of the high bit.
    v_out[i] = (int)(elt_out & ~(elt_out >> 31));
  }
}

(All these casts might not be necessary, but I think it is good practice to use unsigned types for bitwise manipulations. They should all get optimized away anyway.)

CodePudding user response：

If you want fast vectorized code, don't branch and don't do short-circuit eval. You want the compiler to be able to do 16 or 32 elements at once with SIMD operations, using 8-bit elements.

And you don't want the compiler to worry that it's not allowed to touch some memory because the C abstract machine doesn't. e.g. if all the v_out[i] elements are false, the v_x might be a NULL pointer without causing UB! So the compiler can't invent read access to objects that the C logic doesn't read at all.

If v_x is true array, not just a pointer, the compiler would know that it's readable and would be allowed to invent accesses to it by doing if-conversion of the short-circuit logic to branchless. But if its cost heuristics don't see a really big benefit (like auto-vectorization), it might choose not to. Branchy code will often in practice be slower with a random mix of trues and falses (and NAs).

Your 3-state bool that supports an NA state can be implemented with 2 bits, in a way that bitwise AND does your && operation. You can store arrays of it as one per unsigned char, or packed 4 per char to quadruple your throughput for vectorized operations, at the cost of slower access. (Or in general CHAR_BIT/2 per char, but on mainstream C implementations for x86 that's 4.)

F = 00
N = 10 (in binary, so C 0b10 aka 2)
T = 11
conversion to bool with val & 1.
conversion from bool with 0b11 * b or something to broadcast the low bit to both positions.

F & anything = 0 because F is all-zero bits. N&N == N; that's trivially true for any bit-pattern.
The "clever" part is that N&T = T&N = N, since the set bits in T are a superset of those in N.

This also works for || with bitwise |:
F|N == N and F|T == T because 0|x == x. Also x|x == x for any same input so we're still fine there.

N = 0b10 won't set the low bit when ORing, but will clear it when ANDing.

I forgot you said C instead of C , so this bare bones class wrapper (only sufficient to demo a few test callers) might not be relevant, but a loop doing c1[i] &= c2[i]; in plain C for unsigned char *c1, *c2 will auto-vectorize exactly the same way.

struct NBool{ // Nullable bool
    unsigned char val;
    static const unsigned char F = 0b00;
    static const unsigned char T = 0b11;
    static const unsigned char N = 0b10;  // N&T = N;  N&N = N;  N&F = F

    auto operator &=(NBool rhs){   // define && the same way if you want, as non-short-circuiting
        val &= rhs.val;
        return *this;
    }
    operator bool() { return val & 1; }

    constexpr NBool(unsigned char x) : val(x) {};
    constexpr NBool& operator=(const NBool &) = default;

};

#include <stdint.h>
#include <stdlib.h>

bool test(NBool a){
    return a;
}

bool test2(NBool a){
    NBool b = NBool::F;
    return a &= b;   // return false
}


void foo(size_t len, NBool *a1, NBool *a2 )
{
    for (std::size_t i = 0 ; i < len ; i  ){
        a1[i] &= a2[i];
    }
}

This compiles on Godbolt with GCC and clang. Clang auto-vectorizes fairly nicely with an unrolled loop doing vandps. (A bit of a weird choice by clang; on -march=haswell, vpand has better throughput.) But still limited by 1/clock store and 2/clock load anyway; this very much bottlenecks on load/store, even if the data is hot in L1d cache.

(Intel's optimization manual says that even though Skylake's peak L1d bandwidth is 2 loads 1 store per clock (96 bytes with 32-byte vectors), sustained bandwidth is more like 84 bytes per clock)

It can still come relatively close to 32 bytes ANDed per clock cycle, with AVX. So that's 32 NBool & operations, or 128 per clock if you pack 4 NBools per byte.

Packing NBools down to a packed bitmap of 1-bit bools could be done with pslld xmm, 7 / pmovmskb to extract the low bit of each byte (after shifting it to the high bit).

If stored 4 per byte, some SIMD bit-manipulation is in order to pack to bools, perhaps vpshufb as a 4-bit LUT to pack pairs of NBools down to a pair of bools in the bottom of a nibble, then combine? Or use scalar BMI2 pext to extract every other bit from 64 bits, if you're on Zen3 or Haswell or later, for fast pext.