Home > Net >  How do I extract a string spanning multiple lines using regexp in a non greedy way?
How do I extract a string spanning multiple lines using regexp in a non greedy way?

Time:09-26

I'm exploring the possibilities of tagging MarkDown files with json datastructures.

The json data can be kept hidden from print outs by putting them in "comments", see StackOverflow - Comments in Markdown

[json]:# (
[
    "json goes here"
]
)

To extract the json tags I've been playing around with regexp and come up with

\[json]:#.\(([^*]*)\)

This however, only works if I only have one [json]-tag in the md-file.
(See One tag)

With more then one tag, the regexp gets greedy and includes everything in between the first and the last tag :/
(See Multiple tags)

This is sample code for reproducing the issue

$md = @'
[json]:# (
[
    {"jira": "proj-4753"},
    {"creation": "2021-09-25"}
]
)

# Title

## 1. Conclusion
Jada, jada

## 2. Recomendation
blah, blah

[json]:# (
[
    {"sensitivity": "internal"}
]
)

More data

[json]:# (
[
    {"uid": "abc002334"}
]
)
[json]:# (
[
    {"mode": "hallow"}
]
)
and this
'@

($md | Select-String '\[json]:#.\(([^*]*)\)').Matches.Value

No output provided as it will be the complete md except the last line

...

A proper output example when using multiple tags and specifying tag 2 should be like

($md | Select-String '<a working regexp>' -AllMatches).Matches[1].Value

[json]:# (
[
    {"sensitivity": "internal"}
]
)
($md | Select-String '<a working regexp>' -AllMatches).Matches.Value

[json]:# (
[
    {"jira": "proj-4753"},
    {"creation": "2021-09-25"}
]
)
[json]:# (
[
    {"sensitivity": "internal"}
]
)
[json]:# (
[
    {"uid": "abc002334"}
]
)
[json]:# (
[
    {"mode": "hallow"}
]
)

I could of course opt for using only one [json] tag per md.
There's also an optional solution of keeping the tags on one line only, but that will hamper readability.
And that wouldn't make for very robust code with two very possible scenarios (multiple tags and multi line tags) breaking the code.

CodePudding user response:

I suggest using

(?sm)\[json]:#\s*\((.*?)^\)

See the regex demo. Details:

  • (?sm) - s (RegexOptions.Singleline inline option enabling . to match line break chars) and m (RegexOptions.Multiline inline modifier that makes ^ match any line start position and $ match any line end position) on
  • \[json]:# - a \[json]:# substring
  • \s* - zero or more whitespaces
  • \( - a ( char
  • (.*?) - Group 1: any zero or more chars as few as possible
  • ^\) - a ) char at the start of a line.

CodePudding user response:

Since a valid json comment will always be followed by a newline \n and then a closing parenthesis ), use that as your end-of-pattern anchor:

if($md -match '(?s)\[json]:#.\(\s*(.*?)\s*\n\)'){
  $Matches[1]
}

(?s) is the regex engine option for "single-line mode", it makes . match newline characters, allowing us to capture across multiple lines with .*?.

$Matches is an automatic variable that gets populated with all capture group values when a -match operation succeeds in scalar mode.

Result:

[
    {"jira": "proj-4753"},
    {"creation": "2021-09-25"}
]

CodePudding user response:

(?s) will let . match line endings too. The ? after * makes it a lazy match. In powershell 7, select-string will only highlight the first 3 lines as a match. I grouped what's inbetween, except for the line endings. The other way to go is positive lookbehind and positive lookahead.

'one
two
three
one
four
three' | select-string '(?s)one.(.*?).three'

This is highlighted:

one
two
three
$one = [regex]::escape('[json]:# (')
$three = [regex]::escape(')')
$md | select-string "(?s)$one.(.*?).$three"

Only this is highlighted:

[json]:# (
[
    {"jira": "proj-4753"},
    {"creation": "2021-09-25"}
]
)

Showing the group match:

$md | select-string "(?s)$one.(.*?).$three" | % matches | % groups | 
  % value | select -last 1

[
    {"jira": "proj-4753"},
    {"creation": "2021-09-25"}
]
  • Related