How does std::execution::par_unseq handle remove

I have a case where I need to remove elements from a vector if they meet some criteria, which is determined by a function. The layout of my code looks like this:

  auto firstToErase = std::remove_if( myVec.begin(), myVec.end(), [&]( const Obj &obj) {
      return isEntryInvalid( obj );
    } );

  myVec.erase( firstToErase, myVec.end() );

This performs the erasure in two steps. One to identify which entries must be removed, another to actually remove them. The erasure cannot be done in parallel, but the identification of entries to be removed can. Conveniently, C 17 offers execution policies to do this. That looks like this:

  auto firstToErase = std::remove_if( std::execution::par_unseq, myVec.begin(), myVec.end(), [&]( const Obj &obj) {
      return isEntryInvalid( obj );
    } );

  myVec.erase( firstToErase, myVec.end() );

I have this already written, and they work fantastic. And par_unseq provides a roughly 5x performance improvement. However:

How does that remove_if operation work when std::execution::par_unseq is passed?

A naive approach I can think of would be to hand each std::thread its own std::unordered_set<int> that indicate the indices which need to be erased. Then, once all threads finish, join them together into one large set. (Though I'm not sure if that's compatible with what erase() takes.)

But the problem with that approach is that if many removals need to happen or if you're on a system with many threads, that final merge looks pretty ugly and would eat at the performance benefit. (At least in my experience with implementing a similar approach with std::unordered_map)

So clearly, that must not be what remove_if is doing!

How is this handled now? Is there a more clever solution GCC/Clang are using?

To clarify, I don't need a solution to a problem per-se. I already have working code. I just want to know how remove_if is working under the hood.

CodePudding user response：

What I would do was to have each worker thread do a sequential std::remove_if on a subrange of the given range, then std::move_backward the "live" portions of those ranges. This does mean each element is moved twice, but that's allowed.

As a sketch:

vector<subrange> subranges = /* divide [first, last) into thread_count sequential subranges */
vector<tuple<iterator, future<iterator>>> futures;
for (auto s : subranges) {
    futures.emplace_back(s.begin(), async(remove_if, s.begin(), s.end(), pred)); // or some other parallel evaluation
}
for (auto & [s_first, fut] : futures) {
     first = move_backward(s_first, fut.get(), first);
}
return first;

CodePudding user response：

Removing elements in parallel would end up in undefined results. You could separate the search from the removal instruction and wrap the removal in a std::mutex, but that would run the removals itself sequentially.

std::remove_if catches that issue (and other stuff like data races) like this: The implementation of a parallel std::remove_if splits the work to chunks of the array per thread sequentially. Let's assume our vector is 100 elements large and we have 10 threads. Thread 1 runs sequentialy from 0 to 9, thread 2 runs from 10 to 19, ... When this is done all partial results are merged.

The important part (MSVC implementation) looks like this

_TRY_BEGIN
_Static_partitioned_remove_if2 _Operation{_Hw_threads, _Count, _UFirst, _Pass_fn(_Pred)};
_Run_chunked_parallel_work(_Hw_threads, _Operation);
_Seek_wrapped(_First, _Operation._Results);
return _First;
_CATCH(const _Parallelism_resources_exhausted&)
// fall through to serial case below
_CATCH_END

Instead I recommend to perform a parallel sort like std::sort(std::execution::par_unseq, ...); by predicate (your if condition) to move all items to be removed to the end. Then reduce the length using std::vector::resize to the index of the first element matching your criteria. In parallel scenarios having numeric data you might want to use Radix sort.