Thread-specific variables when using parallel algorithms-CodePudding

I was wondering about the following thing: Sometimes when executing an operation on multiple items, a buffer is necessary for example for storing an intermediate result. When operating in a single threaded manner, this is straightforward. However, when processing the work items in parallel, each thread needs its own buffer to write to. I would probably not want to create this buffer inside the body of my functor because this would mean it gets allocated for each iteration, which might be slow and is unnecessary.

Therefore, I was now wondering how to achieve this when using the parallel algorithms from the C STL. I checked the documentation on cppreference and unfortunately couldn't find a definitive answer. It does state though that for the overload taking an execution policy, the functor needs to be copy-constructible. Consequently, I would have assumed that the functor passed to the algorithm gets copied for each thread that is involved. However, I did the following little test and this doesn't seem to be the case (Windows, Visual Studio):

struct Functor
{
  auto operator()(int const&) -> void
  {
    std::strstream x;
    x << GetCurrentThreadId() << ": " << buffer.data() << std::endl;
    std::cout << x.str();
  }
  std::vector<int> buffer{10};
};

int main()
{
  std::vector<int> y(10, 5);
  std::for_each(std::execution::par, y.begin(), y.end(), Functor{});

  return 0;
}

Prints:

46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
46324: 0000000000DB76A0
45188: 0000000000DB76A0

So either what I assumed is not the case, or my test is flawed. I printed the pointer to the data section of the vector member of my functor along with the thread ID. I would have assumed that this pointer varies when the thread ID varies. This is not the case. The same pointer is printed from different threads.

Is my test valid? If so, is there another way that I can have a variable in my functor that is instantiated once per thread? Of course I could create a thread_local variable in the body of my functor, but I kind of dislike this approach because as far as I understand that variable would have static lifetime meaning it would only be destructed once the lifetime of the thread it was allocated in ends.

CodePudding user response：

You can use a table of buffers that is indexed by thread ids. In such a case, each thread will have its own buffer. A simple demo implementation that uses a mutex-protected hash table might look like as follows:

std::unordered_map<std::thread::id, std::vector<int>> map;
std::mutex m;

int main()
{
  std::vector<int> y(10, 5);
  std::for_each(std::execution::par, y.begin(), y.end(),
    [](int)
    {
      thread_local std::vector<int>& buffer =
        []() -> auto& 
        { 
          std::lock_guard<std::mutex> lock(m);
          auto id = std::this_thread::get_id();        
          auto& buffer = map[id];
          return buffer;
        }();

      buffer.resize(10);
      std::cout << id << " : " << buffer.data() << std::endl;
    }
  );
  map.clear();
}

Note that the mutex is locked only once per thread, subsequent uses of buffers don't require the hash table.

CodePudding user response：

You can manage a pool of buffers, and then have each invocation allocate a buffer from the pool, or create a new one if the pool is empty, and release the buffer when it's done. The number of buffers you allocate will then equal the number of concurrent threads used.

A pool of buffers can be a singly-linked stack, which is the simplest possible lock-free data structure.

If the operation of your functor is so small that you'd worry about contention on the pool, then you can make an array of 16 pools indexed by a hash of the thread ID.