C# AsParallel and Parallel: how to merge results less frequently?-CodePudding

C# AsParallel/Parallel: how to merge results less frequently?

For an example, a nested loop of lots iterations, acting as a parallel reduction. (or call it map reduce)

// Dictionary size is small and easily contend.
Dictionary<SumDataCategory, SumData> globalResult;
// A loop to be parallelized.
for (int i = 0; i < N; i  ) {
  for (int j = 0; j < N; j  ) {
    for (int k = 0; k < N; k  ) {
      // Pull inputs depending on i,j,k.
      InputData[] inputs = fetchInputData(i, j, k);
      // Do something independent calculation.
      SumData sum1 = SumData.Sum(inputs);
      // Finally reduction operation. <- to optimize
      globalResult[sum1.Category].Sum(sum1);
    }
  }
}

The target is acquire the low-hanging fruit - improve reduction performance, with small effort. (not to rewrite the whole loop/processing body)

Constraint on the workload characteristics:

It is not completely regular: static partition should not be assumed.
It is not highly dynamic: dynamic job/workload spawning is unnecessary.

Analysis:

Per-iteration reduction to global state is a waste, and high cache contention.
There are magnitudes more iterations/inputs than CPU cores.
Manual partitioning, and handling partition is non trivial. (compared to vanilla AsParallel/Parallel or OpenMP)
Because C# AsParallel/Parallel are libraries without compiler assistance (unlike OpenMP or data-parallel solutions), and most dirty work must be done by user.
Even if partitioning is applied, per-chunk (partitioned data) reduction to global state is still a waste, and more complex. (compared to a final reduction, because final reduction can be serial or parallel, while per-chunk reduction is concurrent)

I could not think of a good solution:

If I use thread-local, I cannot find a time to pull thread-local and merge them.
If I do manual bookkeeping for per-thread results, I need a reliable thread-id and know id range ahead of time. (OpenMP and data-parallel solutions has them)

For reference, in OpenMP, I will do it like this:

Before parallel loop, set or query threads count.
Allocate per-thread reduction data.
Parallel loop, and do thread local reduction by thread id.
After the parallel loop, merge thread local results in serial or another parallel reduction (another parallel loop).

CodePudding user response：

You can first combine your loops into 3 tuple (i,j,k) and then run AsParallel() over that. To limit number of concurrent tasks handling operations in parallel, use WithDegreeOfParallelism(x). So then your data will be partitioned, and each partition will be handled in parallel. To aggregate results from this partitions - use Aggregate function, but take care to use overloads provided for ParallelQuery and not regular IEnumerable. Example with simple sum:

var result = Enumerable.Range(1, 1000).AsParallel()
    .WithDegreeOfParallelism(4)
    .Aggregate(
    0, // initial seed
    (acc, value) => acc   value, // this is executed for accumulators in each partition, separately, updating partition accumulator
    (acc1, acc2) => acc1   acc2, // this is to combine partition accumulators to get final one
    (acc) => acc); // this is to convert the final accumulator into final result (if necessary), so what will be returned