Why is my Regex for removing special characters adding more words to my text?-CodePudding

I encountered the problem when I tired to run my regex function on my text which can be found here.

With a HttpRequest I fetch the text form the link above. Then I run my regex to clean up the text before filtering the most occurrences of a certain word.

After cleaning up the word I split the string by whitespace and added it into a string array and notice there was a huge difference in the number of indexes.

Does anyone know why this happens because the result of occurrences for the word " the " - is 6806 hits.
raw data correct answer is 6806

And with my regex I get - 8073 hits

with regex

The regex i'm using is here in the sandbox with the text and below in the code.

//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);

// Cleaning up a bit
var words = CleanByRegex(rawSource);

string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);

string CleanByRegex(string rawSource)
{
    Regex r = RemoveSpecialChars();
    return r.Replace(rawSource, " ");
}

//  arr {string[220980]} - with regex
//  arr {string[157594]} - without regex

foreach (var word in arr)
{
    // some logic

}


```
partial class Program
{
    [GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
    private static partial Regex RemoveSpecialChars();
}
```

I have tried debugging it and I have my suspicion that I'm adding trailing whitespace but I don't know how to handle it.

I have tired to add a whitespace removing regex where I remove multiple whitespace and replace that with one whitespace.

the regex would look something like - [ ]{2,}"

partial class Program
{
    [GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
    private static partial Regex RemoveWhiteSpaceTrails();
}

CodePudding user response：

It would be helpful if you describe what you're trying to clean up. However your specific question is answerable: from the sandbox I see that you're removing newlines and punctuation. This can definitely lead to occurrences of the that weren't there before:

The quick brown fox jumps over the
lazy dog
//the newline does not match

//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the space*

If you change your search to something not so common, for example Seward, then you should see the same results before and after the regex.

CodePudding user response：

The Reason I belive the regex created more text while I was replacing it with string.empty or " ".

Is because I thought the search in Chrom ctrl f would give me all the words for a certain search and this necessarily isn't true.

I tired my code and instead I added a subset of the lorem Ipsum text. Because I questioned the search on Chrome if it's really the correct answer.

Short answer is NO. If I was to search for " the " that would mean I won't get the "the Environmental.NewLine" which @simmetric proved,

Another scenario is sentences that begins with the word "The ". Since I am curious about the words in the Text I used the following regex \w to get the words and returned a MatchCollection (IList<Match>()) That I later looped through to add the value to my dictionary.

Code Demonstration

var rawSource = "Some text"
var words = CleanByRegex(rawSource);

IList<Match> CleanByRegex(string rawSource)
{
    IList<Match> r = Regex.Matches(rawSource, "\\w ");
    return r;
}

foreach (var word in words)
{
    
    if (word.Value.Length >= 1) // at least 3 letters and has any letters
    {
        if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
            dictionary[word.Value] = dictionary[word.Value]   1; //Increment the count
        else
            dictionary[word.Value] = 1; //put it in the dictionary with a count 1
    }
}