I'm looking for ideas and/or improvements of how to loop through a very large `List` in C#.
I have a List<string>
named excludes
that contains over 4 million strings.
I have a string named message
which is a string generated by user input.
I need to check if the user input (message) contains any of the strings in the list (excludes) and if so, remove the matched string(s) found in the string.
Example:
private static List<string> excludes = new List<string>(); // --> 4 million entries
string message = "Hello, how are you this fine day?\r\nMy name is SO."; // User input
foreach (string exclude in excludes)
{
if (message.Contains(exclude))
{
message = message.Replace(exclude, "");
}
}
On avarage this process takes about 350ms to complete because of the size of the list.
Is there anyway I could improve my code and reduce the time required to complete this task?
CodePudding user response:
I'd look at a Bloom filters concept. It works almost at O(1)
and uses very few memory.
For example, there's is a code sample in C# and a NuGet package.
But pay attention it's a probabilistic data structure and not always yields to a correct result.
CodePudding user response:
You could load the strings into a Trie. Looking up a word to see if it is an excluded word would take at most N comparisons where N is the length of the word.
Here are some resources:
https://www.geeksforgeeks.org/trie-insert-and-search/
https://en.wikipedia.org/wiki/Trie
CodePudding user response:
Here are a few performance boosts, for which I'll give timings on my machine.
- Just try a replacement, without checking if it's contained
foreach(var exclude in _list)
{
message = message.Replace(exclude, "");
}
- You could parallelize it.
This will hopefully speed it up by a factor the number of cores you have, although high numbers of replacements will slow it down, due to locking.
var lockObject = new object();
Parallel.ForEach(excludes, exclude => {
if (message.Contains(exclude))
{
lock(lockObject)
{
message = message.Replace(exclude, "");
}
}
});
- You could parallelize it and just replace. Bit more complicated, as it requires a spin-lock in case of replacement
Parallel.ForEach(_list, exclude => {
string message2;
string message3;
do
{
message2 = message;
message3 = message2.Replace(exclude, "");
if(object.ReferenceEquals(message2, message3))
return;
}while(Interlocked.CompareExchange(ref message, message3, message2) != message2);
});
Method | Timings (sec) |
---|---|
ContainReplace original | 0.43 |
JustReplace | 0.3 |
ParallelContainReplace | 0.12 |
ParallelJustReplace | 0.1 |
CodePudding user response:
Disclaimer: I haven't looked at the word list-it didn't load on my cellphone. I imagine, from the comments, it's like a csv list of 4 million "words"; the sort of "words" you might see if you had a million monkeys bashing on a million typewriters, probably because that's some corollary to how they were actually generated