High performance table structure for really small tables (<10 items usually) where once the table-CodePudding

I am searching for a high performance C structure for a table. The table will have void* as keys and uint32 as values.

The table itself is very small and will not change after creation. The first idea that came to my mind is using something like ska::flat_hash_map<void*, int32_t> or std::unordered_map<void*, int32_t>. However that will be overkill and will not provide me the performance I want (Those tables are suited for high number of items too).

So I thought about using std::vector<std::pair<void*, int32_t>>, sorting it upon creation and linear probing it. The next ideas will be using SIMD instructions but it is possible with the current structure.

Another soloution which I will shortly evaluate is like that:

struct Group
{
    void* items[5]; // search using SIMD
    int32_t items[5]; 
}; // fits in cache line

struct Table
{
     Group* groups;
     size_t capacity;
};

Are there any better options? I need only 1 operation: finding values by keys, not modifying them, not anything. Thanks!

EDIT: another thing I think I should mention are the access patterns: suppose I have an array of those hash tables, each time I will look up from a random one in the array.

CodePudding user response：

You may want to look into perfect hashing -- not too difficult, and can provide simple constant time lookups. It can take technically unbounded time to create the table, though, and it's not as fast as a regular hash table when the regular hash table gets lucky.

I think a nice alternative is an optimization of your simple linear probing idea.

Your lookup procedure would look like this:

Slot *s = &table[hash(key)];
Slot *e = s   s->max_extent;
for (;s<e;   s) {
    if (s->key == key) {
        return s->value;
    }
}
return NOT_FOUND;

Of course you want max_extent to be as small as possible. Pick a hash result size (at least 2n) to make it <= 1 in most cases, and try a few different hash functions before picking the one that produces the best results by whatever metric you like. You hash can be as simple as key % P, where trying different hashes means trying different P values. Fill you hash table in hash(key) order to produce the best result.

CodePudding user response：

Linear probing is likely the fastest solution in this case on common mainstream architectures, especially since the number of element is very small and bounded (ie. <10). Sorting the items should not speed up the probing with so few items (it would be only useful for a binary search which is much more expensive in this case).

If you want to use SIMD instruction, then you need to use structure of arrays instead of array of structures for the sake of performance. This means you should use std::pair<std::vector<void*>, std::vector<int32_t>> instead of std::vector<std::pair<void*, int32_t>> (which alternates void* types and int32_t values in memory with some padding overhead due to the alignment constraints of void* on 64-bit architectures). Having two std::vector is not great too because you pay its overhead twice. As mentioned by @JorgeBellon in the comments, you can simply use a std::array instead of std::vector assuming the number of items is known or bounded.

A possible optimization with SIMD instructions is to compact the key pointers on 64-bit architectures by splitting them in 32-bit lower/upper part. Indeed, it is very unlikely that two pointers have the same lower part (least significant bits) while having a different upper part. This tricks help you to check 2 times more pointers at a time.

Note that using SIMD instructions may not be so great in this case in practice. This is especially true if the number of items is smaller than the one fitting in a SIMD vector. For example, with AVX2 (on 86-64 processors), you can work on 4 64-bit values at a time (or 8 32-bit values) but if you have less than 8 values, then you need to mask the unwanted values to check (or even not load them if the memory buffer do not contain some padding). This introduces an additional overhead. This is not much a problem with AVX-512 and SVE (only available on a small fraction of processors yet) since they provides advanced masking operations. Moreover, some processors lower they frequency when they execute SIMD instructions (especially with AVX-512 although the down-clocking is not so strong with integer instructions). SIMD instructions also introduce some additional latency compared to scalar version (which can be better pipelined) and modern processors tends to be able to execute more scalar instructions in parallel than SIMD ones. For all these reasons, it is certainly a good idea to try to write a scalar branchless implementation (possibly unrolled for better performance if the number of items is known at compile time).