Why isn't Parallel.For fast with heap-intensive operations?-CodePudding

For some operations Parallel scales well with the number of CPU's, but for other operations it does not.

Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?

void function1(int v) {
    for (int i = 0; i < 100000000; i  ) {
        var q = Math.Sqrt(v);
    }
}
void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>();
    for (int i = 0; i < 10000000; i  ) {
        dict.Add(i, v);
    }
}
var sw = new System.Diagnostics.Stopwatch();

var iterations = 100;

sw.Restart();
for (int v = 0; v < iterations; v  ) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
for (int v = 0; v < iterations; v  ) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

The output on my machine:

function1   no parallel:  2 059,4 ms
function1 with parallel:    213,7 ms
function2   no parallel: 14 192,8 ms
function2      parallel:  4 491,1 ms

Environment:
Win 11, .Net 6.0, Release build
i9 12th gen, 16 cores, 24 proc, 32 GB DDR5

After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:

void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}

The result is:

function2   no parallell:   124,0 ms
function2      parallell:   402,4 ms

Is the conclusion that memory allocation does not scale well with multiple threads?...

CodePudding user response：

First func works in registers. More cores = more registers.

Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.

Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.

CodePudding user response：

tl;dr: Heap allocation contention.

Your first function is embarrassingly parallel. Each thread can do its computation with embarrassingly little interaction with other threads. So it scales up nicely to multiple threads. huseyin tugrul buyukisik correctly pointed out that your first computation makes use of the non-shared, per thread, processor registers.

Your second function, when it preallocates the dictionary, is somewhat less embarrassingly parallel. Each thread's computation is independent of the others' except for the fact that they each use your machine's RAM subsystem. So you see some thread-to-thread contention at the hardware level as thread-level cached data is written to and read from the machine-level RAM.

Your second function that does not preallocate memory is not embarrassingly parallel. Why not? Each .Add() operation must allocate some data in the shared heap. That can't be done in parallel, because all threads share the same heap. Rather they must be synchronized. The dotnet libraries do a good job of parallelizing heap operations as much as possible, but they do not avoid at least some blocking of thread B when thread A allocates heap data. So the threads slow each other down.

Separate processes rather than separate threads are a good way to scale up workloads like your non-preallocating second function. Each process has its own heap.