Home > Software engineering >  Regex for rule "cannot have this character before and after"
Regex for rule "cannot have this character before and after"

Time:10-05

Question

I'm trying to match PowerShell dash comments (# ...) but not inline comments (<# .. #>) in same regex. How can I achieve it?

Goal

Match

I'd like to match PowerShell comments (using hashtag comment syntax). So simply everything after # is commented out. I use #(.*$)/gm for it.

Test-cases where the regex match is written inside brackets [..]:

  • Write-Host "Hello world" [# comment here]
  • [# A line with only comment]
  • Comment without whitespace[#before]
  • [Comment with whitespace [#after ]

Do not match

However what I'd like to use here is have an exception for "inline comments syntax". Inline comments in PowerShell looks like lorem <# inline comment #> ipsus.

So here I'm looking for exclusions for:

  • Write-Host "Hello world" <# inline comment here #>
  • <# A line with only inline comment #>
  • Comment without whitespace<#no whitespace#>around
  • Inline comment <# in middle #> of line
  • Comment with whitespace #comment with >
  • Comment with whitespace #comment with <
  • Comment with whitespace #comment with <# test #>

What I tried

I tried to use [^<>] for something like #[^<>](.*[^<>]$) but it did not work for all cases given in the above.

My progress on regex101 until I got stuck.

Why

I'm parsing PowerShell in JavaScript/TypeScript runtime to be able to inline them to run them in batch (cmd) for a community driven open-source project. I know there will be exceptions to this (like strings with dashes inside) but I trade off simple regex parsing for robustness.

Thank you!

CodePudding user response:

I suggest checking for < before a # char and convert all negated character classes into negative lookarounds to avoid crossing over line boundaries:

#(?<!<#)(?![<>])(.*)$(?<![<>])
// Or, to also check for #> after <# use
#(?<!<#(?=.*#>))(?![<>])(.*)$(?<![<>])

See the regex demo. Remove (?<![<>]) negative lookbehind if you do not want to fail the match if the line ends with < or >.

Details:

  • # - a # char
  • (?<!<#) - no <# allowed immediately to the left of the current location (note this check is only triggered after #, so that the regex engine could check only the positions after #, not every position in the string ((?<!<#(?=.*#>)) lookbehind with a nested lookahead makes sure the # matched is not the second char of a <#...#> substring)
  • (?![<>]) - immediately on the right, there must be no < and >
  • (.*) - Group 1: any zero or more chars other than line break chars, as many as possible
  • $ - end of string
  • (?<![<>]) - at the end of string, there must be no < and > chars.
  • Related