Home > database >  Match comments unless the initiating char is surrounded by unescaped quotes
Match comments unless the initiating char is surrounded by unescaped quotes

Time:12-03

I am trying to match comments which begin with a semicolon unless the semicolon is surrounded by unescaped quotes, ...like this:

enter image description here

Note, that the dquotes can by escaped by doubling them up "". Such escaped dquotes behave as completely different characters, i.e. they do not have the ability to surround the semicolon and disable its comment-starting function.

With Bubble's help, I have gotten as far as the regex below, which fails to correctly treat a trailing escaped dquote in the last test vector line.

^(?>(?:""[^""\n]*""|[^;""\n] )*)""?[^"";\n]*(;.*)

See it run here

Test vectors:

Peekaboo ; A comment starts with a semicolon and continues till the EOL
Unless the semicolon is surrounded by dquotes ”Don’t do it ; here” ;but match me; once
Im not surrounded ”so pay attention to me” ; ”peekaboo”
Im not surrounded ”so pay attention” to;me” ; ”peekaboo”
Im not surrounded ”so pay attention to me ; peekaboo
Dquote escapes a dquote so ”dont pay attention to ””me;here”” buster” do it ; here
Don’t pay attention to  ”””me;here””” but do ””it;here””  
and ”dont do ””it;here”””  either ;peekaboo
but "pay attention to "it;here"" ;not here though
Simon said ”I like goats” then he added ”and sheep;” ;a good comment is ”here
Simon said ”I like goats” then he added ”and sheep;” dont do it here
Simon said ””I like goats;”peekaboo
Simon said ”I like goats;””peekaboo

CodePudding user response:

The task is to find comments starting with a ; semicolon outside quotes considering "" escaped quotes and a potential non-closed quote before. This approach works for yet provided test cases.

^((?>(?:(?:[^;"\n]*"(?>(?:""|[^"\n] )*)") )?)[^";\n]*"?[^";\n]*);.*

See this demo at regex101 - The first capturing group $1 contains the part up to the desired ; comment-start. To remove the comment, just replace the full match with the captured substring.

If replacements are done on single lines, all the \n newlines can be dropped from the pattern.

regex-part matches
(?>...) denotes an atomic group, used to prevent any further backtracking
[^...] a negated character class matches a single character not in the listed
(...) and (?:...) capturing and non capturing groups (latter for repitition or alternation)

CodePudding user response:

C# Windows Console App - .NET Framework 4.8

using System;
using System.Text.RegularExpressions;

namespace RegExTest {

    internal class Program {

        private static string _Text = @"{1}Peekaboo ; A comment starts with a semicolon and continues till the EOL{1}
{2}Unless the semicolon is surrounded by dquotes ""Don't do it ; here"" ;but match me; once{2}
{3}Im not surrounded ""so pay attention to me"" ; ""peekaboo""{3}
{4}Im not surrounded ""so pay attention"" to;me"" ; ""peekaboo""{4}
{5}Im not surrounded ""so pay attention to me ; peekaboo{5}
{6}Dquote escapes a dquote so ""dont pay attention to """"me;here"""" buster"" do it ; here{6}
{7}Don't pay attention to  """"""me;here"""""" but do """"it;here""""{7}
{8}and ""dont do """"it;here""""""  either ;peekaboo{8}
{9}but ""pay attention to ""it;here"""" ;not here though{9}
{10}Simon said ""I like goats"" then he added ""and sheep;"" ;a good comment is ""here{10}
{11}Simon said ""I like goats"" then he added ""and sheep;"" dont do it here{11}
{12}Simon said """"I like goats;""peekaboo{12}
{13}Simon said ""I like goats;""""peekaboo{13}
";

        private static void Main(string[] args) {
            foreach(var Line in _Text.Split(new string[] { Environment.NewLine }, StringSplitOptions.RemoveEmptyEntries)) {
                Match Result = Regex.Match(Line, @"((?<=^([^""]*""[^""]*"")*[^""]*);|;(?!(.*[^""])?""("""")*([^""].*)?$)).*$");
                if(Result.Success) {
                    Console.WriteLine(Result.Value);
                }
            }
            Console.ReadLine();
        }
    }
}

Output:

; A comment starts with a semicolon and continues till the EOL{1}
;but match me; once{2}
; "peekaboo"{3}
;me" ; "peekaboo"{4}
; peekaboo{5}
; here{6}
;here""{7}
;peekaboo{8}
;here"" ;not here though{9}
;a good comment is "here{10}
;"peekaboo{12}
;""peekaboo{13}

RegEx:

((?<=^([^""]*""[^""]*"")*[^""]*);|;(?!(.*[^""])?""("""")*([^""].*)?$)).*$

(                   // Keep | section separated from capture to line end
    (?<=            // Start look behind
        ^           // Must match starting at line start
        (           // Start .*".*" search
            [^""]*  // Look for non double quotes
            ""      // Look double quote
            [^""]*  // Look for non double quotes
            ""      // Look double quote
        )*          // Search for any number of double quotes
        [^""]*      // Look for non double quotes
    )               // End look behind
    ;               // Match this semicolon if proceeded by even double quotes.
    |               // OR divider
    ;               // Match this semicolon if followed by no double quotes.
    (?!             // Negative look ahead: odd number of consecutive double quotes
        (           // Optional: Match anything or nothing followed by non double quote
            .*      // Anything or nothing
            [^""]   // Non double quote
        )?          // Optional
        ""          // Double quote
        ("""")*     // Any number of double quote pairs, or nothing
        (           // Optional: Match non double quote followed by anything or nothing
            [^""]   // Non double quote
            .*      // Anything or nothing
        )?          // Optional
        $           // Match line end
    )               // End of negative look ahead
)                   // End of OR division
.*$                 // Capture everything to line end
  • Related