Why isn't Parallel.For faster?-CodePudding

For some operations Parallel scales well with the number of CPU's, but for other operations it does not.

Consider the code below, function1 gets a 10x improvement while function2 gets a 3x improvement. Is this due to memory allocation, or perhaps GC?

void function1(int v) {
    for (int i = 0; i < 100000000; i  ) {
        var q = Math.Sqrt(v);
    }
}
void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>();
    for (int i = 0; i < 10000000; i  ) {
        dict.Add(i, v);
    }
}
var sw = new System.Diagnostics.Stopwatch();

var iterations = 100;

sw.Restart();
for (int v = 0; v < iterations; v  ) function1(v);
sw.Stop();
Console.WriteLine("function1 no parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function1);
sw.Stop();
Console.WriteLine("function1 with parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
for (int v = 0; v < iterations; v  ) function2(v);
sw.Stop();
Console.WriteLine("function2 no parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

sw.Restart();
Parallel.For(0, iterations, function2);
sw.Stop();
Console.WriteLine("function2 parallel: "   sw.Elapsed.TotalMilliseconds.ToString("### ##0.0ms"));

The output on my machine:

function1 no parallel: 2 059,4ms
function1 with parallel:  213,7ms
function2 no parallel: 14 192,8ms
function2 parallel: 4 491,1ms

Environment: Win 11, .Net 6.0, Release build i9 12th gen, 16 cores, 24 proc, 32 GB DDR5

CodePudding user response：

After testing more it seems the memory allocation does not scale that well with multiple threads. For example, if I change function 2 to:

void function2(int v) {
    Dictionary<int, int> dict = new Dictionary<int, int>(10000000);
}

The result is:

function2 no parallell:  124,0ms
function2 parallell:  402,4ms

Is the conclusion that memory allocation does not scale well with multiple threads?...

CodePudding user response：

First func works in registers. More cores = more registers.

Second func works on memory. More cores = only more L1 cache but shared RAM. 10million elements dataset certainly only come from RAM as even L3 is not big enough. This assumes jit of language optimizes allocations as reused buffers. If not, then there is allocation overhead too. So you should re-use dictionary on each new iteration instead of recreating.

Also you are saving data with incremental integer index. Simple array could work here, of course with re-use between iterations. It should have less memory footprint than a dictionary.