Home > Net >  C# Regex whitespace between capturing groups
C# Regex whitespace between capturing groups

Time:02-20

So basically, my input string is some kind of text containing keywords that I want to match, provided that:

  1. each keyword may have whitespace/non-word chars pre/appended, or none (|\s\W)
  2. there must be exactly one non-word/whtiespace char seperating multiple keywords, or keyword is at begining/end of line
  3. Keyword simply ocurring as a substring does not count, e.g. bar does not match foobarbaz

E.g.:

input:    "#foo barbazboo tree car"
keywords: {"foo", "bar", "baz", "boo", "tree", "car"}

I am dynamically generating a Regex in C# using a enumerable of keywords and a string-builder

StringBuilder sb = new();
foreach (var kwd in keywords)
{
   sb.Append($"((|[\\s\\W]){kwd}([\\s\\W]|))|");
}
sb.Remove(sb.Length - 1, 1); // last '|'
_regex = new Regex(sb.ToString(), RegexOptions.Compiled | RegexOptions.IgnoreCase);

Testing this pattern on regexr.com, given input matches all keywords. However, I do not want {bar, baz, boo} included, since there is no whitespace between each keyword. Ideally, I'd want my regex to only match {foo, tree, car}.

Modifying my pattern like (( |[\s\W])kwd([\s\W]| )) causes {bar, baz, boo} not to be included, but produces bogus on {tree, car}, since for that case there must be at least two spaces between keywords.

How do I specify "there may be only one whitespace seperating two keywords", or, to put it differently, "half a whitespace is ok", preserving the ability to create the regex dynamically?

CodePudding user response:

In your case, you need to build the

var pattern = $@"\b(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})\b";
_regex = new Regex(pattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);

Here, you are getting the longer keywords before shorter ones, so, if you have foo, bar and foo bar, the pattern will look like \b(?:foo\ bar|foo|bar)\b and will match foo bar, and not foo and bar once there is such a match.

In case your keywords can look like keywords: {"$foo", "^bar^", "[baz]", "(boo)", "tree ", " car"}, i.e. they can have special chars at the start/end of the keyword, you can use

_regex = new Regex($@"(?!\B\w)(?:{string.Join("|", keywords.Select(Regex.Escape))})(?<!\w\B)", RegexOptions.Compiled | RegexOptions.IgnoreCase);

The $@"(?!\B\w)(?:{string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))})(?<!\w\B)" is an interpolated verbatim string literal that contains

  • (?!\B\w) - left-hand adaptive dynamic word boundary
  • (?: - start of a non-capturing group:
    • {string.Join("|", keywords.OrderByDescending(x => x.Length).Select(Regex.Escape))} - sorts the keywords by lenght in descending order, escapes them and joins with |
  • ) - end of the group
  • (?<!\w\B) - right-hand adaptive dynamic word boundary.
  • Related