RegEx: found all comments but not within quotation marks-CodePudding

I have thought of an example that I can use to learn it well. I designed a kind of "script language" (as a string) that I want to parse and interpret.

PS: Yes, it reminds of LINQ a bit, but that's just a coincidence.

The first thing I thought about, I want to remove all the comments, because these shouldn't be interpreted.

I only look for comments like: /*...*/ and //...\n

However, these should of course not happen within quotation marks: "..." and '...'

But how can I use RegEx to find comments that are not inside quotation marks?

String:

  //get means only read, but not to mutate data
  Get(BooksWithAuthors)
    //default queries via mycel
    .Query()
      //junction table to pair books and authors
      .From(BookAuthor.As(BA))
      //main table for books
      .Join(left: Books.As(B) => B.Id == BA.BookId)
      //main table for authors
      .Join(left: Authors.As(A) => A.Id == BA.AuthorId)
      //groups by column, body allows to restore data (restructuring)
      .GroupBy(B.Id, => B.Authors.Add(A))
      //ignore still registerd data objects for the response
      .SelectIgnore(BA)
      //or select only that fields or objects you want to response
      .Select(B)
      .Foo("//wrong-comment-inside-quotes")
      .Foo('//wrong-comment-inside-single-quotes')
      .Foo('some /*wrong-comment*/ inside')
  ;

  //get means only read, but not to mutate data
  Get(BooksWithAuthorsByMethod)
    //using individual backside methods (created by own)
    .GetBooksWithAuthors(id:6, filter:{key:'minAuthorAge', value:17})
  ;

  /*
    comments
    "over"
    'multiply
    lines' //with wrong comments inside
  *\

RegEx:

.*[^'"].*([\/]{2}.*[\r\n|\r|\n]).*[^'"].*

(https://regex101.com/r/zPzBFj/1)

Yeah, I tried it only with //, but not every incidence was found and it also matches that comments within quotation marks. Maybe ?! is not the right way. But how can I do that?

I'm sure I'll have one or two more questions about this example. But as I said, I'm still learning RegEx, so step by step...

CodePudding user response：

This returns what you're looking for in the example, let me know if you find any edge cases. You'll have to post-process the matches based on whether it's a comment or quoted string.

(?:(?:(\/)(\*)|(["'])).*?(?:\2\1|\3))|(?:\/\/[^\n] )

https://regex101.com/r/uqx1cJ/1

CodePudding user response：

If you match the string with the regular expression

/'.*?'|".*?"|(\/\/[^\r\n]*|\/\*.*?\*\/)/gs

comments will be saved to capture group 1. The idea is to match but not capture what you don't want and match and capture what you do want. Pay no attention to matches that are not captured. Without the DOTALL flag (/s) periods match all characters other than line terminators; with that flag set periods match all characters, including line terminators.

Demo

At the demo link matches that are not captured (not comments, so disregard) are shown in blue, whereas matches that are captured (comments) are shown in green.

The regular expression can be broken down as follows.

'.*?'       # match a single-quote followed by >= 0 chars, lazily,
            # followed by a single-quote
|           # or
".*?"       # match a double-quote followed by >= 0 chars, lazily,
            # followed by a double-quote
|           # or
(           # begin capture group 1
  \/\/      # match '//'
  [^\r\n]*  # match >= 0 chars other than line terminators
  |         # or
  \/\*      # match '/*'
    .*?     # match >= 0 chars, lazily
    \*\/    # match '*/'
)           # end capture group 1

Here is an example of how this works. Suppose the string were as follows.

A dog "is // a\nman's" /* best */ 'friend /* so it */ is' // said

The regex engine performs the following steps.

Fail to match A.
Fail to match after A, then fail to match d, o, g and .
Match but do not capture "is // a\nman's".¹
Fail to match .
Match and capture the comment /* best */.
Fail to match .
Match but do not capture 'friend /* so it */ is'.
Fail to match .
Match and capture the comment // said

^{1. After this match the regex engine's string pointer is between the (last) double-quote just matched and the following space.}