Home > database >  Ultra fast lookup in small sized container of 64-bit integer values using dynamic perfect hashing
Ultra fast lookup in small sized container of 64-bit integer values using dynamic perfect hashing

Time:04-12

I have input keys that can cheaply be converted to a uint64_t (ie, the input contains less than or equal to 64 bits).

Each unique key (not yet in the map) will be assigned some data (a pointer to an object). Much like a std::map<uint64_t, Object*> thus.

Inserting a new key is not time critical because there will only be a small number of total keys, which are never removed again. Where small is less than 100.

A typical implementation would use a std::vector (because of the small number of elements) and either just scan over all elements, or use a binary search (ie boost::flat_map); but this is not optimal enough for me. Once all elements have been inserted the map is static and it should be possible to find the pointer that belongs to a given key in a mere few clock cycles.

What I am thinking of is determining a perfect hash every time a new key is inserted and then use this hash (function) to find back the pointers.

This requires two things:

  1. An algorithm to find a cheap hash function that converts a smallish list of given 64-bit values to 8, 9, 10 (or how many is required) bits with no collisions (aka, a perfect hash function), so that the latter can be used directly in a lookup table (that is 256, 512, 1024... in size).

  2. A method to dynamically create such a hash function; where I have to admit that I am not willing to run an external compiler and load the new hash function dynamically ;-).

CodePudding user response:

If you have a hashing function that includes a constant, then for each possible value of that constant you have a "new" function. For example you could hash a 64-bit value to a value between 0-1023 that you could use as an index into a lookup table like this:

int HashValue(int64_t key, int mult)
{
    return (int)(key ^ ((key * mult) >> 32)) & 1023;
}

where mult is a multiplier constant. For a given set of <= 100 keys, you can just try random values of mult until you find one that doesn't result in any collisions. I gave this a go and it typically finds a "perfect" hashing function after about 50 attempts when hashing to a range of 0-1023. For 0-511 it takes around 20000 attempts and for 0-255 it failed.

Example C implementation below:

using namespace std;
#include <stdlib.h>
#include <time.h>
#include <list>
#include <unordered_set>

int HashValue(int64_t key, int mult)
{
    return (int)(key ^ ((key * mult) >> 32)) & 1023;

    // slower alternative with more thorough mixing
    // key = key ^ (key * mult);
    // int hash = (int)(key ^ (key >> 32));
    // hash ^= key >> 16;
    // return (hash ^ (hash >> 8)) & 1023;
}

int FindHash(std::list<int64_t> keys)
{
    for(int i = 0; i < 10000; i  ) // try 10000 times
    {
        std::unordered_set<int> hashset;
        bool collided = false;  
        int mult = rand();
        for (std::list<int64_t>::iterator it = keys.begin(); it != keys.end(); it  )
        {
            int val = HashValue(*it, mult);
            if(hashset.find(val) != hashset.end())
            {
                collided = true;
                break;
            }
            hashset.insert(val);              
        }
        if(!collided)
        {
            std::cout << "Found collision-free function with mult = " << mult << " on attempt " << i;
            return mult;
        }
    }
    
    std::cout << "Failed to find collision-free hashing function";
    return 0;
}

int main()
{
    // test with 100 random keys
    srand (time(NULL));
    std::list<int64_t> keys = {};        
    for(int i = 0; i < 100; i  )
    {
        // 64 bit random number
        keys.push_back(((int64_t)rand() << 32) | rand()); 
    }

    FindHash(keys);
    
    return 0;
}

CodePudding user response:

This is going to be a partial answer, as I am still working on it.

The random multiplication factor approach

I tried the same approach of samgak: my hash function was just a multiplication of the uint64_t key with a uint64_t multiplication factor, and then I tried random values for the multiplication factor until the least number of high-bits of the result were different. It turns out that this takes easily up till 0.3 seconds to reach a 9 bit output.

Here I used the following function to find the number of high bits required:

// Return the maximal right-shift that is possible on
// the values in `hashes` such that the results are still all different.
int max_shift(std::vector<uint32_t> hashes)
{
  std::sort(hashes.begin(), hashes.end());
  int sm = 0;
  int sz = hashes.size();
  for (int i = 1; i < sz;   i)
  {
    uint32_t d = hashes[i - 1] ^ hashes[i];
    int s = std::countl_zero(d);
    if (s > sm)
      sm = s;
  }
  return 31 - sm;
}

The hashes here are, for example, just the most significant 32-bit of the result of the multiplication; the lower 32 bit would be shifted out / ignored anyway.

Considering that 100 values, being less than 128, theoretically fit into 7 bits (although I never even found 8 bits with the above approach; which isn't weird when you realize that the chance for a random attempt corresponds with the Birthday Problem with 100 people and 256 birthdays; which has a 1 in 5807421181 chance of having no collisions) and that I found that waiting up to 0.3 seconds, and theoretically much longer, was a bit annoying -- I started to think about a way to calculate the hash function.

In order to be able to do any calculations, I decided to use a linear algebra approach.

Using linear algebra

The idea is to use linear algebra (matrices, vectors). And since we are working with arbitrary bits, the most logical thing to do is to work over \mathbb{Z}_{2}. Ok, this external conversion of LaTeX to images isn't working, I'll use UTF8: ℤ₂

Let all the input keys be represented by 64-dimensional column vectors over ℤ₂:

  • Related