I am trying to match comments which begin with a semicolon unless the semicolon is surrounded by unescaped quotes, ...like this:
Note, that the dquotes can by escaped by doubling them up ""
.
Such escaped dquotes behave as completely different characters, i.e. they do not have the ability to surround the semicolon and disable its comment-starting function.
With Bubble's help, I have gotten as far as the regex below, which fails to correctly treat a trailing escaped dquote in the last test vector line.
^(?>(?:""[^""\n]*""|[^;""\n] )*)""?[^"";\n]*(;.*)
See it run here
Test vectors:
Peekaboo ; A comment starts with a semicolon and continues till the EOL
Unless the semicolon is surrounded by dquotes ”Don’t do it ; here” ;but match me; once
Im not surrounded ”so pay attention to me” ; ”peekaboo”
Im not surrounded ”so pay attention” to;me” ; ”peekaboo”
Im not surrounded ”so pay attention to me ; peekaboo
Dquote escapes a dquote so ”dont pay attention to ””me;here”” buster” do it ; here
Don’t pay attention to ”””me;here””” but do ””it;here””
and ”dont do ””it;here””” either ;peekaboo
but "pay attention to "it;here"" ;not here though
Simon said ”I like goats” then he added ”and sheep;” ;a good comment is ”here
Simon said ”I like goats” then he added ”and sheep;” dont do it here
Simon said ””I like goats;”peekaboo
Simon said ”I like goats;””peekaboo
CodePudding user response:
The task is to find comments starting with a ;
semicolon outside quotes considering ""
escaped quotes and a potential non-closed quote before. This approach works for yet provided test cases.
^((?>(?:(?:[^;"\n]*"(?>(?:""|[^"\n] )*)") )?)[^";\n]*"?[^";\n]*);.*
See this demo at regex101 - The first capturing group $1
contains the part up to the desired ;
comment-start. To remove the comment, just replace the full match with the captured substring.
If replacements are done on single lines, all the \n
newlines can be dropped from the pattern.
regex-part | matches |
---|---|
(?> ...) |
denotes an atomic group, used to prevent any further backtracking |
[^ ...] |
a negated character class matches a single character not in the listed |
( ...) and (?: ...) |
capturing and non capturing groups (latter for repitition or alternation) |
CodePudding user response:
C# Windows Console App - .NET Framework 4.8
using System;
using System.Text.RegularExpressions;
namespace RegExTest {
internal class Program {
private static string _Text = @"{1}Peekaboo ; A comment starts with a semicolon and continues till the EOL{1}
{2}Unless the semicolon is surrounded by dquotes ""Don't do it ; here"" ;but match me; once{2}
{3}Im not surrounded ""so pay attention to me"" ; ""peekaboo""{3}
{4}Im not surrounded ""so pay attention"" to;me"" ; ""peekaboo""{4}
{5}Im not surrounded ""so pay attention to me ; peekaboo{5}
{6}Dquote escapes a dquote so ""dont pay attention to """"me;here"""" buster"" do it ; here{6}
{7}Don't pay attention to """"""me;here"""""" but do """"it;here""""{7}
{8}and ""dont do """"it;here"""""" either ;peekaboo{8}
{9}but ""pay attention to ""it;here"""" ;not here though{9}
{10}Simon said ""I like goats"" then he added ""and sheep;"" ;a good comment is ""here{10}
{11}Simon said ""I like goats"" then he added ""and sheep;"" dont do it here{11}
{12}Simon said """"I like goats;""peekaboo{12}
{13}Simon said ""I like goats;""""peekaboo{13}
";
private static void Main(string[] args) {
foreach(var Line in _Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)) {
Match Result = Regex.Match(Line, @"((?<=^([^""]*""[^""]*"")*[^""]*);|;(?!(.*[^""])?""("""")*([^""].*)?$)).*$");
if(Result.Success) {
Console.WriteLine(Result.Value);
}
}
Console.ReadLine();
}
}
}
Output:
; A comment starts with a semicolon and continues till the EOL{1}
;but match me; once{2}
; "peekaboo"{3}
;me" ; "peekaboo"{4}
; peekaboo{5}
; here{6}
;here""{7}
;peekaboo{8}
;here"" ;not here though{9}
;a good comment is "here{10}
;"peekaboo{12}
;""peekaboo{13}
RegEx:
((?<=^([^""]*""[^""]*"")*[^""]*);|;(?!(.*[^""])?""("""")*([^""].*)?$)).*$
( // Keep | section separated from capture to line end
(?<= // Start look behind
^ // Must match starting at line start
( // Start .*".*" search
[^""]* // Look for non double quotes
"" // Look double quote
[^""]* // Look for non double quotes
"" // Look double quote
)* // Search for any number of double quotes
[^""]* // Look for non double quotes
) // End look behind
; // Match this semicolon if proceeded by even double quotes.
| // OR divider
; // Match this semicolon if followed by no double quotes.
(?! // Negative look ahead: odd number of consecutive double quotes
( // Optional: Match anything or nothing followed by non double quote
.* // Anything or nothing
[^""] // Non double quote
)? // Optional
"" // Double quote
("""")* // Any number of double quote pairs, or nothing
( // Optional: Match non double quote followed by anything or nothing
[^""] // Non double quote
.* // Anything or nothing
)? // Optional
$ // Match line end
) // End of negative look ahead
) // End of OR division
.*$ // Capture everything to line end