How to find the matching element in a tiny array with unique elements as quickly as possible?-CodePudding

Inside the Erlang runtime system, persistent hashmaps are represented as hash-array-mapped-tries if they are big, and 'flatmaps' if they are small.

I recently was nerdsniped into looking for ways to optimize this. ^_^'

A flatmap has the following characteristics:

There are at most 32 keys (and 32 values);
They are stored unordered in a C array;
There are no duplicate keys;
Keys are unboxed: We can directly compare the two uint64_t's together to check for a match.

The current implementation of this is:

uint64_t *original_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t max_size) {
  uint64_t n = max_size;
  uint64_t i;

  for (i = 0; i < n;   i) {
    if (keys[i] == key) {
      return &vals[i];
    }
  }
  return NULL;
}

(Simplified from the original)

But this does not use above info at all. I tried what happened if the compiler was made aware that

there are at most 32 elements
It's fine to return 'a' match rather than 'the first'; since the keys are unique there will only ever at most be a single match.

This lead to the following implementation:

uint64_t *latereturn_flatmap_get(uint64_t *keys, uint64_t *vals, uint64_t key, uint64_t max_size) {
  uint64_t n = min(max_size, 32);
  uint64_t i;

  uint64_t *res = NULL;
  for (i = 0; i < n;   i) {
    if (keys[i] == key) {
      res = &vals[i];
    }
  }
  return res;
}

Looking at Compiler Explorer we can see that Clang and GCC are able to vectorize and unroll the loop now. Benchmarking this shows a 5-15% speedup.

However, now for the question: Is it possible to go further?

For instance, is it possible to indicate to the compiler somehow that all elements in the array will be unique which might enable even more optimizations?

Or are there maybe ways to manually write some SIMD instructions directly which are even faster?

CodePudding user response：

I’m not sure how faster it gonna get if at all, but here’s manually vectorized AVX2 version of your function.

uint64_t* flatmap_avx2( const uint64_t* keys, uint64_t* vals, uint64_t key, uint64_t max_size )
{
    const __m256i needle = _mm256_set1_epi64x( (int64_t)key );

    const uint64_t* const keysEnd = keys   max_size;
    const uint64_t* const keysEndAligned = keys   ( max_size / 4 ) * 4;

    for( ; keys < keysEndAligned; keys  = 4, vals  = 4 )
    {
        __m256i src = _mm256_loadu_si256( ( const __m256i* )keys );
        __m256i eq = _mm256_cmpeq_epi64( needle, src );
        uint32_t mask = (uint32_t)_mm256_movemask_epi8( eq );
        if( 0 == mask )
            continue;
        uint32_t byteIndex = _tzcnt_u32( mask );
        // The index is multiple of 8, in assembly all addresses expressed in bytes,
        // yet adding pointers in C adds elements not bytes, that's why casting
        return (uint64_t*)( ( (uint8_t*)vals )   byteIndex );
    }

    for( ; keys < keysEnd; keys  , vals   )
        if( *keys == key )
            return vals;

    return nullptr;
}

If you're building this with VC , ideally add #pragma loop( no_vector ) before the second for loop in the function.

Similarly, if you’re building with gcc or clang, ideally add __attribute__((optimize("no-tree-vectorize"))) before the whole function.

Without these compiler-specific shenanigans, compilers may decide to automatically vectorize the second for loop with the remainder, inflating the code for no good reason.

Another performance-related thing. If you can, align your keys pointer by 32 bytes, will become slightly faster.

CodePudding user response：

I have an idea that goes this way:

The idea here is to get rid of the branch itself, which could, due to branch prediction on all newer (since 15 years) processors waste a lot of cycles and bump the performance of this function.

What I want is something that get executed on all keys, such that the result is mixed all together, but will still give the indication of where our flag is.

So the pseudo-idea-code is

res = 0 // Init to some neutral value
res = res <op> f(keys,key) // do an operation to mix the results with something that is function of the "search" method

Elaborating further:

I know that when I do the bitwise XOR between a number and itself I get all zeros, while in all other cases I get a non zero value.
I also know that XORing with 1 toggles a bit
Processors have also an instruction to detect the number of ones, so I can use that to reduce every non-zero number to a 1, without using any if.
The assumption is that running 32 operations every time is faster than checking the conditional code
Another assumption is that bitwise operations are easily convertible into SIMD instructions and are faster on any processor.

Something that has these properties is the following:

1^reduce_or(keys[i]^key) << i 

keys[i]^key -> gives 0 if key matches, a random number otherwise
reduce_or   -> gives 0 if key matches, 1 otherwise
1^          -> gives 1 if key matches, 0 otherwise
<< i        -> moves the 1 to the bit position at which the key was matched

So the final idea:

res  = 0 
res = res | reduce_or(keys[0]^key) << 0; 
res = res | reduce_or(keys[1]^key) << 1; 
res = res | reduce_or(keys[2]^key) << 2; 
res = res | reduce_or(keys[3]^key) << 3; 
res = res | reduce_or(keys[4]^key) << 4; 
res = res | reduce_or(keys[5]^key) << 5; 
..; 
res = res | reduce_or(keys[31]^key) << 31;

After this pass we should have a number like 000000100000000, and the one is at the index of which the key was found.

We still need to:

Get an integer from the one-hot encoded number.
Get the address.

To pass from the one hot encoded number to the position, it is just log2 of the result. However, this could be slow. I don't really have a solution for this. Maybe we can do, instead of shifting by i, multiply by ì, this will give us a 0 when the reduce_or is 0 and i, that is, the index already, when it is the right key.

 res  = 0 
 res = res   reduce_or(keys[0]^key) * 0; 
 res = res   reduce_or(keys[1]^key) * 1; 
 res = res   reduce_or(keys[2]^key) * 2; 
 res = res   reduce_or(keys[3]^key) * 3; 
 res = res   reduce_or(keys[4]^key) * 4; 
 res = res   reduce_or(keys[5]^key) * 5; 
 ..; 
 res = res   reduce_or(keys[31]^key) * 31;

We should test if doing the log is faster than doing sums and multiplications.

To get the address is just a matter of pointer arithmetic:

addr = vals res

Anyway, this should give a branchless code :) I am curious to see if it will be faster or not!