Parallel.ForEach and ConcurrentDictionary are slower than regular Dictionary-CodePudding

There is a task. The array contains arbitrary strings. We need to count how many times each of the strings occurs in the array. Solve the task in one thread and multithreaded, compare the execution time.

For some reason, the single-threaded version runs faster than the multi-threaded one: 90 ms versus 300 ms. How to fix the multi-threaded version so that it runs faster than the single-threaded one?

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.Threading;

namespace ParallelDictionary
{
    class Program
    {
        static void Main(string[] args)
        {
            List<string> strs = new List<string>();
            for (int i=0; i<1000000; i  )
            {
                strs.Add("qqq");
            }
            for (int i=0;i< 5000; i  )
            {
                strs.Add("aaa");
            }

            F(strs);
            ParallelF(strs);
        }

        private static void F(List<string> strs)
        {
            Dictionary<string, int> freqs = new Dictionary<string, int>();

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            for (int i=0; i<strs.Count; i  )
            {
                if (!freqs.ContainsKey(strs[i]))
                    freqs[strs[i]] = 1;
                else
                    freqs[strs[i]]  ;
            }
            stopwatch.Stop();

            Console.WriteLine("single-threaded {0} ms", stopwatch.ElapsedMilliseconds);
            foreach (var kvp in freqs)
            {
                Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);
            }
        }

        private static void ParallelF(List<string> strs)
        {
            ConcurrentDictionary<string, int> freqs = new ConcurrentDictionary<string, int>();

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            Parallel.ForEach(strs, str =>
            {
                freqs.AddOrUpdate(str, 1, (key, value) => value   1);
            });
            stopwatch.Stop();

            Console.WriteLine("multi-threaded {0} ms", stopwatch.ElapsedMilliseconds);
            foreach (var kvp in freqs)
            {
                Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);
            }
        }
    }
}

CodePudding user response：

It's possible to make the multi-threaded version a little faster than the single-threaded version by using a partitioner to split the data up into chunks that you process separately.

Then each chunk can be processed into a separate non-concurrent dictionary without needing any locking. Finally at the end of each range, you can update a non-concurrent results dictionary (which you would have to lock).

Something like this:

private static void ParallelF(List<string> strs)
{
    Dictionary<string, int> result = new Dictionary<string, int>();

    Stopwatch stopwatch = new Stopwatch();
    stopwatch.Start();

    object locker = new object();

    Parallel.ForEach(Partitioner.Create(0, strs.Count), range => 
    {
        var freqs = new Dictionary<string, int>();

        for (int i = range.Item1; i < range.Item2;   i)
        {
            if (!freqs.ContainsKey(strs[i]))
                freqs[strs[i]] = 1;
            else
                freqs[strs[i]]  ;
        }

        lock (locker)
        { 
            foreach (var kvp in freqs)
            {
                if (!result.ContainsKey(kvp.Key))
                {
                    result[kvp.Key] = kvp.Value;
                }
                else
                {
                    result[kvp.Key]  = kvp.Value;
                }
            }
        }
    });

    stopwatch.Stop();

    Console.WriteLine("multi-threaded {0} ms", stopwatch.ElapsedMilliseconds);
    foreach (var kvp in result)
    {
        Console.WriteLine("{0} {1}", kvp.Key, kvp.Value);
    }
}

On my system that gives the following results (for a release build, .NET 6):

single-threaded 50 ms
qqq 1000000
aaa 5000
multi-threaded 26 ms
qqq 1000000
aaa 5000

It's only a little faster... if that's worth it is for you to decide.

CodePudding user response：

Here is another approach, that has similarities with Matthew Watson's Partitioner-based solution, but uses a different API. It uses an advanced Parallel.ForEach overload, that has the signature shown below:

/// <summary>
/// Executes a foreach (For Each in Visual Basic) operation with thread-local data
/// on an System.Collections.IEnumerable in which iterations may run in parallel,
/// loop options can be configured, and the state of the loop can be monitored and
/// manipulated.
/// </summary>
public static ParallelLoopResult ForEach<TSource, TLocal>(
    IEnumerable<TSource> source,
    ParallelOptions parallelOptions,
    Func<TLocal> localInit,
    Func<TSource, ParallelLoopState, TLocal, TLocal> body,
    Action<TLocal> localFinally);

As TLocal is used a local dictionary, that contains the partial results calculated by a single worker thread:

static Dictionary<string, int> GetFrequencies(List<string> source)
{
    Dictionary<string, int> frequencies = new();
    ParallelOptions options = new()
    {
        MaxDegreeOfParallelism = Environment.ProcessorCount
    };
    Parallel.ForEach(source, options, () => new Dictionary<string, int>(),
        (item, state, partialFrequencies) =>
    {
        ref int occurences = ref CollectionsMarshal.GetValueRefOrAddDefault(
            partialFrequencies, item, out bool exists);
        occurences  ;
        return partialFrequencies;
    }, partialFrequencies =>
    {
        lock (frequencies)
        {
            foreach ((string item, int partialOccurences) in partialFrequencies)
            {
                ref int occurences = ref CollectionsMarshal.GetValueRefOrAddDefault(
                    frequencies, item, out bool exists);
                occurences  = partialOccurences;
            }
        }
    });
    return frequencies;
}

The above code also demonstrates the use of the low-lever CollectionsMarshal.GetValueRefOrAddDefault API, that allows to search and update a dictionary with a single hashing of the key.

I didn't measure it (nor tested it), but I expect it to be slower than Matthew Watson's solution. The reason is that the source is enumerated in a synchronized fashion. If you can handle the complexity, you could consider combining both approaches for optimal performance.