So basically, my input string is some kind of text containing keywords that I want to match, provided that:
- each keyword may have whitespace/non-word chars pre/appended, or none
(|\s\W)
- there must be exactly one non-word/whtiespace char seperating multiple keywords, or keyword is at begining/end of line
- Keyword simply ocurring as a substring does not count, e.g.
bar
does not matchfoobarbaz
E.g.:
input: "#foo barbazboo tree car"
keywords: {"foo", "bar", "baz", "boo", "tree", "car"}
I am dynamically generating a Regex in C# using a enumerable of keywords and a string-builder
StringBuilder sb = new();
foreach (var kwd in keywords)
{
sb.Append($"((|[\\s\\W]){kwd}([\\s\\W]|))|");
}
sb.Remove(sb.Length - 1, 1); // last '|'
_regex = new Regex(sb.ToString(), RegexOptions.Compiled | RegexOptions.IgnoreCase);
Testing this pattern on regexr.com, given input matches all keywords. However, I do not want {bar, baz, boo}
included, since there is no whitespace between each keyword.
Ideally, I'd want my regex to only match {foo, tree, car}
.
Modifying my pattern like (( |[\s\W])kwd([\s\W]| ))
causes {bar, baz, boo}
not to be included, but produces bogus on {tree, car}
, since for that case there must be at least two spaces between keywords.
How do I specify "there may be only one whitespace seperating two keywords", or, to put it differently, "half a whitespace is ok", preserving the ability to create the regex dynamically?
CodePudding user response:
In your case, you need to build the
var pattern = $@"\b(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})\b";
_regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
Here, you are getting the longer keywords before shorter ones, so, if you have foo
, bar
and foo bar
, the pattern will look like \b(?:foo\ bar|foo|bar)\b
and will match foo bar
, and not foo
and bar
once there is such a match.
In case your keywords can look like keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree ", " car"}
, i.e. they can have special chars at the start/end of the keyword, you can use
_regex = new Regex($@"(?!\B\w)(?:{string.Join("|", keywords.Select(Regex.Escape))})(?<!\w\B)", RegexOptions.Compiled | RegexOptions.IgnoreCase);
The $@"(?!\B\w)(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})(?<!\w\B)"
is an interpolated verbatim string literal that contains
(?!\B\w)
- left-hand adaptive dynamic word boundary(?:
- start of a non-capturing group:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))}
- sorts the keywords by lenght in descending order, escapes them and joins with|
)
- end of the group(?<!\w\B)
- right-hand adaptive dynamic word boundary.