With the .NET Regex
class, is there any way to match a regular expression inside a string only if the match starts exactly at a specific character index?
Let's look at an example:
- regular expression
ab
- input string:
ababab
Now, I can search for matches for the regular expression (named expr
in the following) in the input string, for instance, starting at character index 2
:
var match = expr.Match("ababab", 2);
// match ------------->XXab
This will be successful and return a match at index 2
.
If I pass index 1
, this will also be successful, pointing to the same occurrence as above:
var match = expr.Match("ababab", 1);
// match ------------->X ab
Is there any efficient way to have the second test fail, because the match does not start exactly at the specified index?
Obviously, there are some work-arounds to this. As my string in which testing occurs might be ... "long" (think possibly 4 digit numbers of characters), I would, however, prefer to avoid the overhead that would presumably occur in all three cases one way or another:
# | Work-Around | Drawback |
---|---|---|
1 | I could check the resulting match to see whether its Index property matches the supplied index. |
Matching throughout the entire string would still take place, at least until the first match is found (or the end of the string is reached). |
2 | I could prepend the start anchor ^ to my regular expression and always test just the substring starting at the specified index. |
As the string may be very long and I might be testing the same regex on multiple starting positions (but, again, only exactly on these), I am concerned about performance drawbacks from the frequent partial copying of the long string. (Ranges might be a way out here, but unfortunately, the Regex class cannot (yet?) be used to scan them.) |
3 | I could prepend "^.{#}" (with # being replaced with the character index to test) for each expression and match from the beginning, then fish out the actually interesting match with a capturing group. |
I need to test the same regex on multiple possible start positions throughout my input string. As each time, the number of skipped characters changes, that would mean compiling a new regex every time, rather than re-using the one that I have, which again feels somewhat unclean. |
Lastly, the Match
overload that accepts a maximum length to check in addition to the start index does not seem useful, as in my case, the regular expression is not fixed and may well include variable-length portions, so I have no idea about the expected length of a match in advance.
CodePudding user response:
It appears you can use the \G
operator, \Gab
pattern will allow you to match at the second index and will fail at the first one, see this C# demo:
Regex expr = new Regex(@"\Gab");
Console.WriteLine(expr.Match("ababab", 1)?.Success); // => False
Regex expr2 = new Regex(@"\Gab");
Console.WriteLine(expr2.Match("ababab", 2)?.Success); // => True
As per the documentation, \G
operator matches like this:
The match must occur at the point where the previous match ended, or if there was no previous match, at the position in the string where matching started."