I encountered the problem when I tired to run my regex function on my text which can be found here.
With a HttpRequest I fetch the text form the link above. Then I run my regex to clean up the text before filtering the most occurrences of a certain word.
After cleaning up the word I split the string by whitespace and added it into a string array and notice there was a huge difference in the number of indexes.
Does anyone know why this happens because the result of occurrences for the word " the " - is 6806 hits.
raw data correct answer is 6806
And with my regex I get - 8073 hits
The regex i'm using is here in the sandbox with the text and below in the code.
//Application storing.
var dictionary = new Dictionary<string, long>(StringComparer.OrdinalIgnoreCase);
// Cleaning up a bit
var words = CleanByRegex(rawSource);
string[] arr = words.Split(" ", StringSplitOptions.RemoveEmptyEntries);
string CleanByRegex(string rawSource)
{
Regex r = RemoveSpecialChars();
return r.Replace(rawSource, " ");
}
// arr {string[220980]} - with regex
// arr {string[157594]} - without regex
foreach (var word in arr)
{
// some logic
}
```
partial class Program
{
[GeneratedRegex("(?:[^a-zA-Z0-9]|(?<=['\\\"]\\s))", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-SE")]
private static partial Regex RemoveSpecialChars();
}
```
I have tried debugging it and I have my suspicion that I'm adding trailing whitespace but I don't know how to handle it.
I have tired to add a whitespace removing regex where I remove multiple whitespace and replace that with one whitespace.
the regex would look something like - [ ]{2,}"
partial class Program
{
[GeneratedRegex("[ ]{2,}", RegexOptions.Compiled)]
private static partial Regex RemoveWhiteSpaceTrails();
}
CodePudding user response:
It would be helpful if you describe what you're trying to clean up.
However your specific question is answerable: from the sandbox I see that you're removing newlines and punctuation. This can definitely lead to occurrences of the
that weren't there before:
The quick brown fox jumps over the
lazy dog
//the newline does not match
//after regex:
The quick brown fox jumps over the lazy dog
//now there's one more *the space*
If you change your search to something not so common, for example Seward
, then you should see the same results before and after the regex.
CodePudding user response:
The Reason I belive the regex created more text while I was replacing it with string.empty or " "
.
Is because I thought the search in Chrom ctrl f would give me all the words for a certain search and this necessarily isn't true.
I tired my code and instead I added a subset of the lorem Ipsum text. Because I questioned the search on Chrome if it's really the correct answer.
Short answer is NO.
If I was to search for " the " that would mean I won't get the "the Environmental.NewLine"
which @simmetric proved,
Another scenario is sentences that begins with the word "The "
. Since I am curious about the words in the Text I used the following regex \w
to get the words and returned a MatchCollection (IList<Match>())
That I later looped through to add the value to my dictionary.
Code Demonstration
var rawSource = "Some text"
var words = CleanByRegex(rawSource);
IList<Match> CleanByRegex(string rawSource)
{
IList<Match> r = Regex.Matches(rawSource, "\\w ");
return r;
}
foreach (var word in words)
{
if (word.Value.Length >= 1) // at least 3 letters and has any letters
{
if (dictionary.ContainsKey(word.Value)) //if it's in the dictionary
dictionary[word.Value] = dictionary[word.Value] 1; //Increment the count
else
dictionary[word.Value] = 1; //put it in the dictionary with a count 1
}
}