preventing data races in shared hash table-CodePudding

I'm sorry if this is a duplicate, but as much as I search I only find solutions that don't apply:

so I have a hash table, and I want multiple threads to be simultaneously reading and writing to the table. But how do I prevent data races when:

threads writing to the same hash as another
threads writing to a hash being read

edit: if possible, because this hash needs to be extremely fast as it's accessed extremely frequently, is there a way to lock two racing threads only if they are accessing the same index of the hash table?

CodePudding user response：

I have answered variations of this question before. Please read my previous answer regarding this topic.

Many people have tried to implement thread safe collection classes (lists, hash tables, maps, sets, queues, etc... ) and failed. Or worse, failed, didn't know it, but shipped it anyway.

A naive way to build a thread-safe hash table is to start with an existing hash table implementation and add a mutex to all public methods. You could imagine a hypothetical implementation is this:

// **THIS IS BAD**

template<typename K, typename V>
class ThreadSafeMap
{
private:
    std::map<K,V> _map;
    std::mutex _mutex;

public:

    void insert(const K& k, const V& v)
    {
        std::lock_guard lck(_mutex);
        _map[k] = v;
    }

    const V& at(const K& key)
    {
        std::lock_guard lck(_mutex);
        return _map.at(k);
    }

    // other methods not shown - but are essentially a repeat of locking a mutex
    // before accessing the underlying data structure

};

In the the above example, std::lock_guard locks mutex when the lck variable is instantiated, and lock_guard's destructor will release the mutex when the lck variable goes out of scope

And to a certain extent, it is thread safe. But then you start to use the above data structure in a complex ways, it breaks down.

Transactions on hash tables are often multi-step operations. For example, an entire application transaction on the table might be to lookup a record and upon successfully returning it, change some member on what the record points to.

So imagine we had used the above class across different threads like the following:

 ThreadSafeMap g_map<std::string, Item>;

 // thread 1
 Item& item = g_map.at(key);
 item.value  ;
 

 // thread 2
 Item& item = g_map.at(key);
 item.value--;

 // thread 3
 g_map.erase(key);
 g_map[key] = newItem;

It's easy to think the above operations are thread safe because the hash table itself is thread safe. But they are not. Thread 1 and thread 2 are both trying to access the same item outside of the lock. Thread 3 is even trying to replace that record that might be referenced by the other two threads. There's a lot of undefined behavior here.

The solution? Stick with a single threaded hash table implementation and use the mutex at the application/transaction level. Better:

 std::unordered_map<std::string, Item> g_map;
 std::mutex g_mutex;

 // thread 1
 {
   std::lock_guard lck(g_mutex);
   Item& item = g_map.at(key);
   item.value  ;
 }
 

 // thread 2
 {
   std::lock_guard lck(g_mutex);
   Item& item = g_map.at(key);
   item.value--;
 }

 // thread 3
 {
   std::lock_guard lck(g_mutex);
   g_map.erase(key);
   g_map[key] = newItem;
 }

Bottom line. Don't just stick mutexes and locks on your low-level data structures and proclaim it to be thread safe. Use mutexes and locks at the level the caller expects to do its set of operations on the hash table itself.

CodePudding user response：

The most reliable and appropriate way to avoid data races is to serialize access to the hash table using a mutex; i.e. each thread needs to acquire the mutex before performing any operations (read or write) on the hash table, and release the mutex after it is done.

What you're probably looking for, though, is to implement a lock-free hash table, but ensuring correct multithreaded behavior without locks is extremely difficult to do correctly, and if you were at the technical level required to implement such a thing, you wouldn't need to ask about it on Stackoverflow. So I strongly suggest that you either stick with the serialized-access approach (which works fine for 99% of the software out there, and is possible to implement correctly without in-depth knowledge of the CPU, cache architecture, RAM, OS, scheduler, optimizer, C language spec, etc) or if you must use a lock-free data structure, that you find a premade one from a reputable source to use rather than trying to roll your own. In fact, even if you want to roll your own, you should start by looking through the source code of working examples, to get an idea of what they are doing and why they are doing it.

CodePudding user response：

So you need basic thread synchronization or what? You must use mutex, lock_guard, or some other mechanism for thread synchronization in the read and write functions. In cppreference.com you have the documentation of the standard library.