You need to supply a range of input (type IT) (ibegin, iend), a range to copy the output (type OT) to (obegin, oend), and a side effect free function f to transform().
OT f(IT x)...
transform(execution::par_unseq, ibegin, iend, obegin, f);
Each core on the machine it runs on has a hierarchy of cache L1/L2/L3.
first stage:
transform() will map each input value to one core, and copy all selected input value to each core.
second stage:
Once this is done, each core can start running the function on its own inputs to produce its own outputs, all of this happening hopefully at the fastest cache level. During this phase, if the input needs to access shared memory because the data has pointers that need to be dereferenced for example, all cores will be accessing the memory concurrently and might slow each other down.
third stage:
Once this process is over, the outputs produced by each core get copied into transform's output container (pointed to by obegin).
Is this a good description ? Or should I expect a different behavior ?
CodePudding user response:
Your description is not a permitted implementation of transform
. That would require that the input iterator's value type is copyable or moveable, as well as that of the output iterator. Neither of these is required for transform
, so that's not a valid implementation.
The parallel transform
takes forward iterators (like most parallel algorithms), both for the inputs and the output. This means that the objects backing the iterators exist (or can be computed) independently of those iterators.
This permits implementations to assign ranges of inputs and output to each thread. That's typically how the non-vectorized transform
is implemented. Each thread gets a range of input and output iterators, and they transform those ranges.
Any questions of caching are not (directly) specified by the C standard. Since each thread is writing to independent output objects, by the rules of the C memory model, there can be no data races or other problems. That is, if two different threads are writing to the same cache line, but writing to different memory locations within that cache line, those issues need to be sorted out either by the compiler or CPU. It's not something the C program needs to deal with.