Home > Back-end >  Loop through very large List<string> with over 4 million entries in C#
Loop through very large List<string> with over 4 million entries in C#

Time:10-22


I'm looking for ideas and/or improvements of how to loop through a very large `List` in C#.

I have a List<string> named excludes that contains over 4 million strings.
I have a string named message which is a string generated by user input.

I need to check if the user input (message) contains any of the strings in the list (excludes) and if so, remove the matched string(s) found in the string.

Example:

private static List<string> excludes = new List<string>(); // --> 4 million entries
string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input

foreach (string exclude in excludes)
{
    if (message.Contains(exclude))
    {
        message = message.Replace(exclude, "");
    }
}

On avarage this process takes about 350ms to complete because of the size of the list.
Is there anyway I could improve my code and reduce the time required to complete this task?

CodePudding user response:

I'd look at a Bloom filters concept. It works almost at O(1) and uses very few memory.

For example, there's is a code sample in C# and a NuGet package.

But pay attention it's a probabilistic data structure and not always yields to a correct result.

CodePudding user response:

You could load the strings into a Trie. Looking up a word to see if it is an excluded word would take at most N comparisons where N is the length of the word.

Here are some resources:

https://www.geeksforgeeks.org/trie-insert-and-search/

https://en.wikipedia.org/wiki/Trie

CodePudding user response:

Here are a few performance boosts, for which I'll give timings on my machine.

  • Just try a replacement, without checking if it's contained
foreach(var exclude in _list)
{
    message = message.Replace(exclude, "");
}
  • You could parallelize it.
    This will hopefully speed it up by a factor the number of cores you have, although high numbers of replacements will slow it down, due to locking.
var lockObject = new object();
Parallel.ForEach(excludes, exclude => {
    if (message.Contains(exclude))
    {
        lock(lockObject)
        {
            message = message.Replace(exclude, "");
        }
    }
});
  • You could parallelize it and just replace. Bit more complicated, as it requires a spin-lock in case of replacement
Parallel.ForEach(_list, exclude => {
    string message2;
    string message3;
    do
    {
        message2 = message;
        message3 = message2.Replace(exclude, "");
        if(object.ReferenceEquals(message2, message3))
            return;
            
    }while(Interlocked.CompareExchange(ref message, message3, message2) != message2);
});
Method Timings (sec)
ContainReplace original 0.43
JustReplace 0.3
ParallelContainReplace 0.12
ParallelJustReplace 0.1

CodePudding user response:

Disclaimer: I haven't looked at the word list-it didn't load on my cellphone. I imagine, from the comments, it's like a csv list of 4 million "words"; the sort of "words" you might see if you had a million monkeys bashing on a million typewriters, probably because that's some corollary to how they were actually generated

  •  Tags:  
  • c#
  • Related