Home > Blockchain >  Fastest way to determine if a uint64 has been "seen" already
Fastest way to determine if a uint64 has been "seen" already

Time:05-21

I've been interested in optimizing "renumbering" algorithms that can relabel an arbitrary array of integers with duplicates into labels starting from 1. Sets and maps are too slow for what I've been trying to do, as are sorts. Is there a data structure that only remembers if a number has been seen or not reliably? I was considering experimenting with a bloom filter, but I have >12M integers and the target performance is faster than a good hashmap. Is this possible?

Here's a simple example pseudo-c algorithm that would be slow:

// note: all elements guaranteed > 0
std::vector<uint64_t> arr = { 21942198, 91292, 21942198, ... millions more };

std::unordered_map<uint64_t, uint64_t> renumber;
renumber.reserve(arr.size());

uint64_t next_label = 1;
for (uint64_t i = 0; i < arr.size(); i  ) {
    uint64_t elem = arr[i];
    if (renumber[elem]) {
        arr[i] = renumber[elem];
    }
    else {
        renumber[elem] = next_label;
        arr[i] = next_label;
          next_label;
    }
}

Example input/output:

{ 12, 38, 1201, 120, 12, 39, 320, 1, 1  }
->
{ 1, 2, 3, 4, 1, 5, 6, 7, 7 }

CodePudding user response:

Updated with bug fix

First, anytime you experience "slowness" with a std:: collection class like vector or map, just recompile with optimizations (release build). There is usually a 10x speedup.

Now to your problem. I'll show a two-pass solution that runs in O(N) time. I'll leave it as an exercise for you to convert to a one-pass solution. But I'll assert that this should be fast enough, even for vectors with millions of items.

First, declare not one, but two unordered maps:

std::unordered_map<uint64_t, uint64_t> element_to_label;
std::unordered_map<uint64_t, std::pair<uint64_t, std::vector<uint64_t>>> label_to_elements;

The first map, element_to_label maps an integer value found in the original array to it's unique label.

The second map, label_to_elements maps to both the element value and the list of indices that element occurs in the original array.

Now to build these maps:

element_to_label.reserve(arr.size());
label_to_elements.reserve(arr.size());

uint64_t next_label = 1;
for (size_t index = 0; index < arr.size(); index  )
{
    const uint64_t elem = arr[index];
    auto itor = element_to_label.find(elem);
    if (itor == element_to_label.end())
    {
        // new element
        element_to_label[elem] = next_label;

        auto &p = label_to_elements[next_label];
        p.first = elem;
        p.second.push_back(index);
        next_label  ;
    }
    else
    {
        // existing element
        uint64_t label = itor->second;
        label_to_elements[label].second.push_back(index);
    }
}

When the above code runs, it's built up a database all values in the array, their labels, and indices where they occur.

So now to renumber the array such that all elements are replaced with their smaller label value:

for (auto itor = label_to_elements.begin(); itor != label_to_elements.end(); itor  )
{
    uint64_t label = itor->first;
    auto& p = itor->second;
    uint64_t elem = p.first; // technically, this isn't needed. It's just useful to know which element value we are replacing from the original array
    const auto& vec = p.second;

    for (size_t j = 0; j < vec.size(); j  )
    {
        size_t index = vec[j];
        arr[index] = label;
    }
}

Notice where I assign variables by reference with the & operator to avoid making an expensive copy of any value in the maps.

So if your original vector or array was this:

{ 100001, 2000002, 300003, 400004, 400004, 300003, 2000002, 100001 };

Then the application of labels would render the array as this: {1,2,3,4,4,3,2,1}

And what's nice you still have a quick O(1) look operator to map any label in that set back to its original element value using label_to_elements

CodePudding user response:

Your algorithm is not bad, but the appropriate data structure to use for the map is a hash table with open addressing.

As explained in this answer, std::unordered_map can't be implemented that way: https://stackoverflow.com/a/31113618/5483526

So if the STL container is really too slow for you, then you can do better by making your own.

Note, however, that:

  1. 90% of the time, when someone complains about STL containers being too slow, they are running a debug build with optimizations turned off. Make sure you are running a release build compiled with optimizations on. Running your code on 12M integers should take a few milliseconds at most.
  2. You are accessing the map multiple times when only once is required, like this:
uint64_t next_label = 1;
for (size_t i = 0; i < arr.size(); i  ) {
    uint64_t elem = arr[i];
    uint64_t &label = renumber[elem];
    if (!label) {
       label = next_label  ;
    }
    arr[i] = label;
}

Note that the unordered_map operator [] returns a reference to the associated value (creating it if it doesn't exist), so you can test and modify the value without having to search the map again.

  • Related