I'm trying to write a basic function that takes an input text, creates regex for this input and returns all output as a collection.
I wrote this:
string pattern = @"(\wh*al*re)"; // take this pattern from outside
Regex rg = new Regex(pattern, RegexOptions.IgnoreCase);
MatchCollection matchedAuthors = rg.Matches(authors);
for (int count = 0; count < matchedAuthors.Count; count )
{
Console.WriteLine(count);
Console.WriteLine(matchedAuthors[count].Value);
}
my text --> "asdad healthcare basdasd"
but if I'm given the pattern halre my regex pattern looks like this --> (\whalre)
and output is --> "are"
Expected behaviour
Input: h*al*re Output: healthcare
What is the problem in my regex ?
The solution is
(\bh\w*al\w*re)
thanks to @anubhava
CodePudding user response:
what is problem in my regex ?
Regex is not like DOS filename wildcards
In DOS then h*al*re
really would match "healthcare" because *
in DOS means "zero or more of any character"
In Regex the meaning is subtly different; it means "zero or more of the thing to the left of the asterisk"
h*
- means zero or more h characters in a rowl*
- means zero or more l characters in a row
This means that h*al*re
will match something like "hhhhhhhhhallllllllre" or "hhalllllllllllllllllllllllllllllllre" or (as you have found) "are" which is zero "h", then "a", then zero "l", then "re" - it fully complies with a pattern that asks for zero or more "h"
What you need to do is combine *
with another Regex construct such as .
which means "any single character".
When you put .*
it means "match zero or more of: any single character"
Thus your Regex to match "healthcare" is h.*al.*re
Note that it would also match heealthcare, hzzzzzzalzzzzzzre etc..
the next thing you have to contend with is the concept of greedy vs pessimistic matching
.*
is greedy; it tries to match as much as possible. This means it consumes the entire input then spits it back out a char at a time trying to make the match succeed
If you had a sentence of "the biggest issue in healthcare is that healthcare providers are overloaded everywhere" and you ran your Regex on it your h.*a.*re
ends up matching "the biggest issue in healthcare is that healthcare providers are overloaded everywhere"
The bold bits are the fixed characters in your regex (the "h", the "a" and the "re") and the italic bits are what the .*
are matching - this is what you get when you try to match as much as possible
You probably want pessimistic matching where the matched tries to match as little as possible rather than as much as possible, and for that you need another modifier to change the behavior of the *, which is done by putting a ? after the *
.*?
will modify the * so that rather than consuming the entire input and then working backwards, it works forwards looking for a match, so h.*?a.*?re
matches just "healthcare", but it also matches "hare"..
To this end you might want to consider not using *
at all but instead using something more specific, like:
h. ?al. ?re // means "one or more of the thing to the left"
h.{2}al.{4}re //{n} means exactly n of the thing to the left
But the main take away; ditch everything you know about wildcards from DOS etc if you're getting into learning Regex