Why C implementation using std::unordered_map is much slower than equivalent Python implementation-CodePudding

For this I am using the coordinate transformation (x,y)-> 1000*x y for efficiency.

It's not very important to understand what the code is doing, but it for this problem: https://oeis.org/A337663

This simply adds ones to the board and then removes them as a metric for performance:

##################

#1###1###1###1###1#

##################

And keeps track of the sums for the neighbors that are touching a number on the board

#include <iostream>
#include <vector>
#include <unordered_map>
#include <unordered_set>
#include <ctime>

using namespace std;
//I Know this is bad practice, but just for readability for now

void add_update_edges_and_used(int spot, unordered_map<int, unordered_set<int> > &edge_sums_to_locations, unordered_map<int, int> &edge_locations_to_sums,
                               unordered_set<int> &used_locations, int current_number) {
    used_locations.insert(spot);

    vector<int> neighbors { spot 1000,spot-1000,
                            spot 1,spot-1,
                            spot 1000 1,spot-1000 1,
                            spot 1000-1,spot-1000-1 };

    for (int neighbor : neighbors) {
        if (used_locations.count(neighbor) == 0) {
            if (edge_locations_to_sums.count(neighbor)) {
                edge_sums_to_locations.at(edge_locations_to_sums.at(neighbor)).erase(neighbor);
                edge_locations_to_sums.at(neighbor)  = current_number;
            } else {
                edge_locations_to_sums.insert({neighbor, current_number});
            }

            int new_neighbor_sum = edge_locations_to_sums[neighbor];
            if (edge_sums_to_locations.count(new_neighbor_sum)) {
                edge_sums_to_locations.at(new_neighbor_sum).insert(neighbor);
            } else {
                unordered_set<int> new_edge_sum_locations;
                new_edge_sum_locations.insert(neighbor);
                edge_sums_to_locations.insert({new_neighbor_sum, new_edge_sum_locations});
            }

        }
    }
}

int main() {

    std::clock_t start_time = std::clock();

    unordered_map<int, unordered_set<int> > edge_sums_to_locations;
    unordered_map<int, int> edge_locations_to_sums;
    unordered_set<int> used_locations;


    for (int q=0; q<1000; q  ) {
        edge_sums_to_locations.clear();
        edge_locations_to_sums.clear();
        used_locations.clear();

        for (int i=0; i<100; i  ) {
            add_update_edges_and_used(i*4, edge_sums_to_locations, edge_locations_to_sums,
                                      used_locations, 1);
        }
    }

    std::clock_t tot_time = std::clock() - start_time;
    std::cout << "Time: "
              << ((double) tot_time) / (double) CLOCKS_PER_SEC
              << " seconds" << std::endl;

    return 0;
}

Takes ~1 second

import time
def add_update_edges_and_used(spot, edge_sums_to_locations, edge_locations_to_sums, 
                              used_locations, current_number):
    used_locations.add(spot)
    
    neighbors = {spot 1000,spot-1000,
                spot 1,spot-1,
                spot 1000 1,spot-1000 1,
                spot 1000-1,spot-1000-1}
    unused_neighbors = neighbors.difference(used_locations)
    
    for neighbor in unused_neighbors:
        if neighbor in edge_locations_to_sums.keys():
            edge_sums_to_locations[edge_locations_to_sums[neighbor]].remove(neighbor)
            edge_locations_to_sums[neighbor]  = current_number
        else:
            edge_locations_to_sums[neighbor] = current_number
        new_neighbor_sum = edge_locations_to_sums[neighbor]
        if new_neighbor_sum in edge_sums_to_locations.keys():
            edge_sums_to_locations[new_neighbor_sum].add(neighbor)
        else:
            edge_sums_to_locations[new_neighbor_sum] = {neighbor}
            
start_time = time.time()
start_cpu_time = time.clock()

for q in range(1000):
    edge_sums_to_locations = {} #unordered map of ints to unordered set of ints
    edge_locations_to_sums = {} #unordered map of ints to ints
    used_locations = set() #unordered set of ints

    for i in range(100):
        add_update_edges_and_used(i*4, edge_sums_to_locations, edge_locations_to_sums, 
                                  used_locations, 1)

print(f'CPU time {time.clock() - start_cpu_time}')
print(f'Wall time {time.time() - start_time}')

Takes ~0.4 seconds

This problem persists scaled up and is not related to the erase function, but the insert and remove based on profiling.

I have always heard C is just in general quicker, so I was hoping I could improve my speed for this function by converting from python to c .

CodePudding user response：

Most of the execution time should be spent in cache misses, allocations and in the overheads of the container implementations (ie. low-level details). This is why enabling optimizations should have a small impact (if not even negligible). The compiler cannot easily optimize hash-table containers. The gap between CPython and the target C STL implementation is coming from the different implementation of the containers.

The C unordered_map and unordered_set implementations uses a separate chaining method for the hash-table. This is due to several constraints in the C standard (as described in the provided answer). Put it shortly, the implementation is basically a bucket of linked-lists (ie. vector<list<pair<Key,Value>>>) which is known to be pretty inefficient. Moreover, some implementation also use a modulus to transform the hash to an index in the buckets and this operation is known to be pretty slow on most platform.

The 3.6 CPython implementation uses a growing array of integer indices referencing a growing array containing entries. Each entry is basically a tuple containing the hash, the key and the associated value. See here for more information about this. CPython uses an open-addressing method and more specifically a quadratic probing which is pretty efficient as long as the load factor is particularly small and there are not a lot of collisions (otherwise, the table needs to be small).

Let's compare the two implementation. In the C implementation, each bucket can be pretty big since it typically contains a pointer to an entry (ie. key-value pair) and pointers takes 8 bytes on a mainstream 64-bit platform. In the CPython implementation, the index array contains integers of variable size: from 1 byte to 8 bytes regarding the number of item in the dictionary. This is efficient since the table can be much more compact in memory and it also reduce the risk to cause cache misses (since the table can be contained in few cache line when using 1 byte per item). The array of entry is significantly bigger but it has the benefit of being packed in memory (and ordered). In the end, the data locality of the (recent) CPython implementation tends to be better than the one of C implementations.

Please try alternative C hash-map implementation. There are very efficient non-standard C hash-map implementations using open-addressing you can find on internet. For example, the robin-map and the hopscotch-map of Tessil, as well as the ankerl::unordered_dense::map are particularly efficient and pretty-well designed (note that the hash should also be tuned to get correct performance with some hash-map implementations). You can find some great benchmarks here, here and there. Note that there is no one hash-map to rule them all: each of them have benefits and drawbacks (variable speed and memory footprint regarding the use-case). Still, many of them can often significantly outperform the one of the STL implementations in most use-cases. Regarding the benchmarks and your specific use-case, I advise you to try the emhash7::HashMap with the mumx hash function.