Home > Blockchain >  Remove xml comments based on xml tags inside the comments with Powershell
Remove xml comments based on xml tags inside the comments with Powershell

Time:02-01

I want to remove comments in xml files based on the xml tags inside the comment with Powershell.
Constraints:

  • Multi line comments should be supported
  • Keep xml formatting (e.g. do not write everything into a single line or remove indents)
  • Keep file encoding

My function UncommentXmlNode should remove the <!-- ... --> and keep the <InnerXml>. My function UncommentMyTwoNodes should remove comments from two different xml tags. You find two tests:

  1. it "uncomments myFirstOutcommentedXml and mySecondOutcommentedXml" is running smoothly
  2. it "uncomments both if both are in same comment" fails unless you insert (`n)?.*. In that case, 1. breaks.

The tests are fairly easy to understand, if you look at [xml]$expected and the two respective [xml]$inputXml values. The code here is a fully functional Pester test suite to reproduce my issue. You might have to create C:\temp or install Pester v5.

Import-Module Pester

Describe "Remove comments"{
    BeforeAll {
      function UncommentXmlNode {
        param (
            [String] $filePath,
            [String] $innerXmlToUncomment
        )
        $content = Get-Content $filePath -Raw
        $content -replace "<!--(?<InnerXml>$innerXmlToUncomment)-->", '${InnerXml}' | Set-Content -Path $filePath -Encoding utf8
    }

    function UncommentMyTwoNodes {
        param (
          [xml]$inputXml,
          [string]$inputXmlPath
        )    
        UncommentXmlNode -filePath $inputXmlPath -innerXmlToUncomment "<myFirstOutcommentedXml.*" #Add this to make second test work (`n)?.*
        UncommentXmlNode -filePath $inputXmlPath -innerXmlToUncomment "<mySecondOutcommentedXml.*"
    }

[xml]$expected = @"
<myXml>
  <!-- comment I want to keep -->
  <myFirstOutcommentedXml attributeA="xy" attributeB="true" />
  <mySecondOutcommentedXml attributeA="xy" attributeB="true" />
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@
  }
    it "uncomments myFirstOutcommentedXml and mySecondOutcommentedXml"{
          [xml]$inputXml = @"
<myXml>
  <!-- comment I want to keep -->
  <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
  <!--<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@

      $tempPath = "C:\temp\test.xml"
      $inputXml.Save($tempPath)
      UncommentMyTwoNodes -inputXml $inputXml -inputXmlPath $tempPath
      [xml]$result = Get-Content $tempPath
      $result.OuterXml | Should -be $expected.OuterXml
    }
  
    it "uncomments both if both are in same comment"{
        [xml]$inputXml = @"
<myXml>
  <!-- comment I want to keep -->
  <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
  <mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@
      $tempPath = "C:\temp\test.xml"
      $inputXml.Save($tempPath)
      UncommentMyTwoNodes -inputXml $inputXml -inputXmlPath $tempPath
      [xml]$result = Get-Content $tempPath
      $result.OuterXml | Should -be $expected.OuterXml
    }
  }

CodePudding user response:

I made some changes to your code to make it easier to test::

  • first of all just working with plain strings without converting to [xml] and calling .OuterXml
  • second, just working with plain strings and not reading / writing to disk
  • I've also removed all the Pester testing code for the sake of clarity

So, here's some test data to work with:

$expected = @"
<myXml>
  <!-- comment I want to keep -->
  <myFirstOutcommentedXml attributeA="xy" attributeB="true" />
  <mySecondOutcommentedXml attributeA="xy" attributeB="true" />
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@

# two tags inside separate xml comments
$inputXml1 = @"
<myXml>
  <!-- comment I want to keep -->
  <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
  <!--<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@

# two tags inside a single xml comment
$inputXml2 = @"
<myXml>
  <!-- comment I want to keep -->
  <!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
  <mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
  <myOtherXmlTag attributeC="value" />
  <!-- comment I want to keep -->
</myXml>
"@

Here's the updated functions:

function UncommentXmlNode
{
    param
    (
        [string] $xml,
        [string] $uncomment
    )
    return $xml -replace "(?s)<!--(?<InnerXml><$uncomment.*?)-->", '${InnerXml}'
    #                     ^^^^                           ^^^
    #                     single-line (eats `n)          lazy / non-greedy
}

function UncommentMyTwoNodes
{
    param (
      [string] $xml
    )    
    $xml = UncommentXmlNode -xml $xml -uncomment "myFirstOutcommentedXml"
    $xml = UncommentXmlNode -xml $xml -uncomment "mySecondOutcommentedXml"
    return $xml
}

And here's some example usage:

(UncommentMyTwoNodes -xml $inputXml1) -eq $expected
# True

(UncommentMyTwoNodes -xml $inputXml2) -eq $expected
# True

The differences are:

  • enabling the single-line option in the regex - (?s) - "so that it matches every character, instead of matching every character except for the newline character \n"

  • turning the greedy .* into a lazy .*? by adding a lazy quantifier. This is needed because otherwise (?s) above causes your --> to match the last instance in the input string. Changing it to lazy makes it match the first --> after the opening <!--.

This works for both your test cases now, but you might find other edge-cases that still fail (including if $uncomment contains regex escape chars)...


Epilogue

Treating xml as plain text isn't always the best plan. For example the above function will fail with simple pathological cases - for example:

  • Whitespace in the element text - e.g.:
<!--<   myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
     ^^^

A more robust approach would be to parse the xml and then process all the comment nodes to check their contents:

$comments = ([xml] "...").SelectNodes("//comment()")
foreach( $comment in $comments )
{
    ...
}
  • Related