I'm an R developer that uses C for algorithmic purposes and have a question about why a C loop that seems like it would be slow is actually faster than the alternative approach.
In R, our boolean type can actually have 3 values, true
, false
, and na
, and we represent this using an int
at the C level.
I'm looking into a vectorized &&
operation (yes we have this in R already, but bear with me) that also handles the na
case. The scalar results would look like this:
F && F == F
F && T == F
F && N == F
T && F == F
T && T == T
T && N == N
N && F == F
N && T == N
N && N == N
Note that it works like &&
in C except that na
values propagate when combined with anything except false
, in which case we "know" that &&
can never be true, so we return false
.
Now to the implementation, assume we have two vectors v_out
and v_x
, and we'd like to perform the vectorized &&
on them. We are allowed to overwrite v_out
with the result. One option is:
// Option 1
for (int i = 0; i < size; i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
and another option is:
// Option 2
for (int i = 0; i < size; i) {
int elt_out = v_out[i];
if (elt_out == 0) {
continue;
}
int elt_x = v_x[i];
if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
I sort of expected the second option to be faster because it avoids accessing v_x[i]
when it isn't required. But in fact it was twice as slow when compiled with -O2
!
In the following script, I get the following timing results. Note that I am on a Mac and compile with clang.
Seems reasonable with O0, they are about the same.
2x faster with O2 with Option 1!
Option 1, `clang -O0`
0.110560
Option 2, `clang -O0`
0.107710
Option 1, `clang -O2`
0.032223
Option 2, `clang -O2`
0.070557
Can anyone explain what is going on here? My best guess is that it has something to do with the fact that in Option 1 v_x[i]
is always being accessed linearly, which is extremely fast. But in Option 2, v_x[i]
is essentially being accessed randomly (sort of) because it might access v_x[10]
but then not need another element from v_x
until v_x[120]
, and because that access isn't linear, it is probably much slower.
Reproducible script:
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>
#include <time.h>
int main() {
srand(123);
int size = 1e7;
int na = INT_MIN;
int* v_out = (int*) malloc(size * sizeof(int));
int* v_x = (int*) malloc(size * sizeof(int));
// Generate random numbers between 1-3
// 1 -> false
// 2 -> true
// 3 -> na
for (int i = 0; i < size; i) {
int elt_out = rand() % 3 1;
if (elt_out == 1) {
v_out[i] = 0;
} else if (elt_out == 2) {
v_out[i] = 1;
} else {
v_out[i] = na;
}
int elt_x = rand() % 3 1;
if (elt_x == 1) {
v_x[i] = 0;
} else if (elt_x == 2) {
v_x[i] = 1;
} else {
v_x[i] = na;
}
}
clock_t start = clock();
// Option 1
for (int i = 0; i < size; i) {
int elt_out = v_out[i];
int elt_x = v_x[i];
if (elt_out == 0) {
// Done
} else if (elt_x == 0) {
v_out[i] = 0;
} else if (elt_out == na) {
// Done
} else if (elt_x == na) {
v_out[i] = na;
}
}
// // Option 2
// for (int i = 0; i < size; i) {
// int elt_out = v_out[i];
//
// if (elt_out == 0) {
// continue;
// }
//
// int elt_x = v_x[i];
//
// if (elt_x == 0) {
// v_out[i] = 0;
// } else if (elt_out == na) {
// // Done
// } else if (elt_x == na) {
// v_out[i] = na;
// }
// }
clock_t end = clock();
double time = (double) (end - start) / CLOCKS_PER_SEC;
free(v_out);
free(v_x);
printf("%f\n", time);
return 0;
}
CodePudding user response:
Why is this seemingly slower C loop actually twice as fast as the other way?
At a high level, it is a quirk of the compiler and execution environment you are using. Unless array v_x
is declared volatile
, the compiler is free to interpret the two variations on your code exactly the same way.
I sort of expected the second option to be faster because it avoids accessing
v_x[i]
when it isn't required.
And if the compiler's optimizer judged that to be true, then it could make use of that judgement to conditionally avoid reading v_x[i]
in conjunction with the first code.
But at a lower level, if the compiler generates code that indeed does conditionally avoid reading v_x[i]
in option 2 but not in option 1, then you are probably observing the effects of branch misprediction in the option 2 case. It is entirely plausible that it is cheaper on average to read v_x[i]
unconditionally than to suffer large numbers of branch misprediction penalties involving whether it should be read.
One of the takeaways is that on modern hardware, branches can be a lot more expensive than one might expect, especially when the branch is difficult for the CPU to predict. Where the same computation can be performed via a branchless approach, that may yield a performance win in practice, at, usually, a cost in source code clarity. @KarlKnechtel's answer presents a possible branchless (but for testing the for
loop condition, which is pretty predictable) variation on the computation you are trying to perform.
CodePudding user response:
Note that it works like && in C except that na values propagate when combined with anything except false, in which case we "know" that && can never be true, so we return false.
Rather than represent the values as a strict enumeration, allow a numeric value of either 2
or 3
to represent na
(you can check this when displaying, or have a normalization step after all the number-crunching). This way, no conditional logic (and thus no costly branch prediction) is needed: we simply logical-or the bit in the 2s place (regardless of the operator), and logical-and (or whichever operator) the bit in the 1s place.
int is_na(int value) {
return value & 2;
}
void r_and_into(unsigned* v_out, unsigned* v_x, int size) {
for (int i = 0; i < size; i) {
unsigned elt_out = v_out[i];
unsigned elt_x = v_x[i];
// this can probably be micro-optimized somehow....
v_out[i] = (elt_out & elt_x & 1) | ((elt_out | elt_x) & 2);
}
}
If we are forced to use INT_MIN
to represent the N/A value, we can start by observing what that looks like in 2s-complement: it has exactly one bit set (the sign bit, which would be most-significant in unsigned values). Thus, we can use that bit value instead of 2
with the same kind of unconditional logic, and then correct any (INT_MIN | 1
) results to INT_MIN
:
const unsigned MSB_FLAG = (unsigned)INT_MIN;
void r_and_into(int* v_out, int* v_x, int size) {
for (int i = 0; i < size; i) {
unsigned elt_out = (unsigned)v_out[i];
unsigned elt_x = (unsigned)v_x[i];
elt_out = (elt_out & elt_x & 1) | ((elt_out | elt_x) & MSB_FLAG);
// if the high bit is set, clear the low bit
// I.E.: AND the low bit with the negation of the high bit.
v_out[i] = (int)(elt_out & ~(elt_out >> 31));
}
}
(All these casts might not be necessary, but I think it is good practice to use unsigned types for bitwise manipulations. They should all get optimized away anyway.)
CodePudding user response:
If you want fast vectorized code, don't branch and don't do short-circuit eval. You want the compiler to be able to do 16 or 32 elements at once with SIMD operations, using 8-bit elements.
And you don't want the compiler to worry that it's not allowed to touch some memory because the C abstract machine doesn't. e.g. if all the v_out[i]
elements are false, the v_x
might be a NULL pointer without causing UB! So the compiler can't invent read access to objects that the C logic doesn't read at all.
If v_x
is true array, not just a pointer, the compiler would know that it's readable and would be allowed to invent accesses to it by doing if-conversion of the short-circuit logic to branchless. But if its cost heuristics don't see a really big benefit (like auto-vectorization), it might choose not to. Branchy code will often in practice be slower with a random mix of trues and falses (and NAs).
Your 3-state bool that supports an NA state can be implemented with 2 bits, in a way that bitwise AND does your &&
operation. You can store arrays of it as one per unsigned char
, or packed 4 per char to quadruple your throughput for vectorized operations, at the cost of slower access. (Or in general CHAR_BIT/2
per char
, but on mainstream C implementations for x86 that's 4.)
F = 00
N = 10
(in binary, so C0b10
aka2
)T = 11
- conversion to
bool
withval & 1
. - conversion from
bool
with0b11 * b
or something to broadcast the low bit to both positions.
F & anything = 0
because F is all-zero bits. N&N == N
; that's trivially true for any bit-pattern.
The "clever" part is that N&T = T&N = N
, since the set bits in T
are a superset of those in N
.
This also works for ||
with bitwise |
:
F|N == N
and F|T == T
because 0|x == x
. Also x|x == x
for any same input so we're still fine there.
N = 0b10
won't set the low bit when ORing, but will clear it when ANDing.
I forgot you said C instead of C , so this bare bones class wrapper (only sufficient to demo a few test callers) might not be relevant, but a loop doing c1[i] &= c2[i];
in plain C for unsigned char *c1, *c2
will auto-vectorize exactly the same way.
struct NBool{ // Nullable bool
unsigned char val;
static const unsigned char F = 0b00;
static const unsigned char T = 0b11;
static const unsigned char N = 0b10; // N&T = N; N&N = N; N&F = F
auto operator &=(NBool rhs){ // define && the same way if you want, as non-short-circuiting
val &= rhs.val;
return *this;
}
operator bool() { return val & 1; }
constexpr NBool(unsigned char x) : val(x) {};
constexpr NBool& operator=(const NBool &) = default;
};
#include <stdint.h>
#include <stdlib.h>
bool test(NBool a){
return a;
}
bool test2(NBool a){
NBool b = NBool::F;
return a &= b; // return false
}
void foo(size_t len, NBool *a1, NBool *a2 )
{
for (std::size_t i = 0 ; i < len ; i ){
a1[i] &= a2[i];
}
}
This compiles on Godbolt with GCC and clang. Clang auto-vectorizes fairly nicely with an unrolled loop doing vandps
. (A bit of a weird choice by clang; on -march=haswell
, vpand
has better throughput.) But still limited by 1/clock store and 2/clock load anyway; this very much bottlenecks on load/store, even if the data is hot in L1d cache.
(Intel's optimization manual says that even though Skylake's peak L1d bandwidth is 2 loads 1 store per clock (96 bytes with 32-byte vectors), sustained bandwidth is more like 84 bytes per clock)
It can still come relatively close to 32 bytes ANDed per clock cycle, with AVX. So that's 32 NBool &
operations, or 128 per clock if you pack 4 NBools per byte.
Packing NBools down to a packed bitmap of 1-bit bools could be done with pslld xmm, 7
/ pmovmskb
to extract the low bit of each byte (after shifting it to the high bit).
If stored 4 per byte, some SIMD bit-manipulation is in order to pack to bools, perhaps vpshufb
as a 4-bit LUT to pack pairs of NBools down to a pair of bools in the bottom of a nibble, then combine? Or use scalar BMI2 pext
to extract every other bit from 64 bits, if you're on Zen3 or Haswell or later, for fast pext
.