C# AsParallel/Parallel: how to merge results less frequently?
For an example, a nested loop of lots iterations, acting as a parallel reduction. (or call it map reduce)
// Dictionary size is small and easily contend.
Dictionary<SumDataCategory, SumData> globalResult;
// A loop to be parallelized.
for (int i = 0; i < N; i ) {
for (int j = 0; j < N; j ) {
for (int k = 0; k < N; k ) {
// Pull inputs depending on i,j,k.
InputData[] inputs = fetchInputData(i, j, k);
// Do something independent calculation.
SumData sum1 = SumData.Sum(inputs);
// Finally reduction operation. <- to optimize
globalResult[sum1.Category].Sum(sum1);
}
}
}
The target is acquire the low-hanging fruit - improve reduction performance, with small effort. (not to rewrite the whole loop/processing body)
Constraint on the workload characteristics:
- It is not completely regular: static partition should not be assumed.
- It is not highly dynamic: dynamic job/workload spawning is unnecessary.
Analysis:
- Per-iteration reduction to global state is a waste, and high cache contention.
There are magnitudes more iterations/inputs than CPU cores. - Manual partitioning, and handling partition is non trivial. (compared to vanilla AsParallel/Parallel or OpenMP)
Because C# AsParallel/Parallel are libraries without compiler assistance (unlike OpenMP or data-parallel solutions), and most dirty work must be done by user. - Even if partitioning is applied, per-chunk (partitioned data) reduction to global state is still a waste, and more complex. (compared to a final reduction, because final reduction can be serial or parallel, while per-chunk reduction is concurrent)
I could not think of a good solution:
- If I use thread-local, I cannot find a time to pull thread-local and merge them.
- If I do manual bookkeeping for per-thread results, I need a reliable thread-id and know id range ahead of time. (OpenMP and data-parallel solutions has them)
For reference, in OpenMP, I will do it like this:
- Before parallel loop, set or query threads count.
- Allocate per-thread reduction data.
- Parallel loop, and do thread local reduction by thread id.
- After the parallel loop, merge thread local results in serial or another parallel reduction (another parallel loop).
CodePudding user response:
You can first combine your loops into 3 tuple (i,j,k)
and then run AsParallel()
over that. To limit number of concurrent tasks handling operations in parallel, use WithDegreeOfParallelism(x)
. So then your data will be partitioned, and each partition will be handled in parallel. To aggregate results from this partitions - use Aggregate
function, but take care to use overloads provided for ParallelQuery and not regular IEnumerable
. Example with simple sum:
var result = Enumerable.Range(1, 1000).AsParallel()
.WithDegreeOfParallelism(4)
.Aggregate(
0, // initial seed
(acc, value) => acc value, // this is executed for accumulators in each partition, separately, updating partition accumulator
(acc1, acc2) => acc1 acc2, // this is to combine partition accumulators to get final one
(acc) => acc); // this is to convert the final accumulator into final result (if necessary), so what will be returned