Home > Software design >  How to select lines between { and } inclusive in powershell using regex
How to select lines between { and } inclusive in powershell using regex

Time:12-11

I am trying to remove text from a file between { and } inclusive but only if it contains the string "RG " - the third and fourth group in the following. (Note that if it is the last in the list it will not have the trailing comma.)

     "morphs" : [
        {
           "uid" : "AQ_Eun-Ju-Body-2022-31-7--15_02_30.592.vmi",
           "name" : "AQ_Eun-Ju-Body",
           "value" : "1"
        },
        {
           "uid" : "AQ_Eun-Ju-Head-2022-31-7--15_02_30.592.vmi",
           "name" : "AQ_Eun-Ju-Head",
           "value" : "1"
        },
        {
           "uid" : "RG side2side.vmi",
           "name" : "RG side2side",
           "value" : "-0.3332869"
        },
        {
           "uid" : "RG UpDown2.vmi",
           "name" : "RG UpDown2",
           "value" : "-0.3332869"
        }
     ]

I can get it to work with -replace '.*{\n.\*RG .\*\n.\*\n.\*\n.\*}.\*\n','' however if the group does not have three lines it fails because of the explicit linefeeds. I can create a replace for each number of lines but that seems clunky. I tried 'RG.\*?},\*\n' which gives me the last part, but I'm struggling with the first part.

This is what I have so far:


Get-ChildItem $VAMfixDir -recurse -include *.json,*.vap,*.vaj | Where-Object { $timestamp -lt $_.CreationTime } |
Foreach-Object {  
    $originalContent = $_ | Get-Content -Raw
    # *Potentially* perform replacements, depending on whether the search patterns are found.
    $potentiallyModifiedContent = $originalContent -Replace ".*{\n.*RG .*\n.*\n.*\n.*}.*\n|.*{\n.*RG .*\n.*\n.*}.*\n",""
    Set-Content -NoNewLine -Encoding Ascii -LiteralPath $_.FullName -Value $potentiallyModifiedContent
    }
    

EDIT: The file in the example above that I'm trying to edit IS a json file, but I'm trying to create a POWERSHELL script to remove groups of lines from it using REGEX. I have shown the regex that works but it has its limitations, as stated. I was hoping for a more elegant solution than a massive OR'd -Replace statement.

CodePudding user response:

  • As noted, it's generally preferable to use a dedicated parser and serializer for parsing JSON data, namely ConvertFrom-Json and ConvertTo-Json

  • However, regex-based transformations may be an option if you're looking to preserve the exact formatting of the input file and/or the desired transformations are syntactically limited in a way that allows them to based on regexes reliably, which does appear to be the case here.

$potentiallyModifiedContent = 
  $originalContent -Replace '(?:,\s*)?\{[^}] \bRG\b[^}] \}'

For a detailed explanation of the regex and the ability to experiment with it, see this regex101.com page.


As for what you tried:

.*{\n.\*RG .\*\n.\*\n.\*\n.\*}.\*\n

  • While not always necessary (it isn't in this case), it's best to routinely escape { and } characters meant to be taken literally, given that these characters are metacharacters used for quantifiers, with the proper syntax between them (e.g., {2} matches the previous subexpression exactly 2 times)

  • Conversely, if you do want * to be treated as a metacharacter (a quantifier matching the preceding subexpression zero or more times), do not escape it (as \*).

  • In general, you can use the SingleLine regex option to make . match newlines too, so that .* would match across lines. The simplest way to activate this option is to place (?s) at the start of the regex.

  • {\n.\*RG - if corrected to \{\n.*RG or even to the non-greedy \{\n.*?RG - is too permissive, as it will start matching at the first {, even if that block does not contain RG and keep matching across the end of that and potentially later ones until RG is found in a block.

  • Ultimately, it's best to use [^{] and [^}] , as shown above, to match the characters after the opening { and before the closing }, which implicitly matches across lines too.

See also:

  • Related