I want to remove comments in xml files based on the xml tags inside the comment with Powershell.
Constraints:
- Multi line comments should be supported
- Keep xml formatting (e.g. do not write everything into a single line or remove indents)
- Keep file encoding
My function UncommentXmlNode
should remove the <!-- ... -->
and keep the <InnerXml>
.
My function UncommentMyTwoNodes
should remove comments from two different xml tags.
You find two tests:
it "uncomments myFirstOutcommentedXml and mySecondOutcommentedXml"
is running smoothlyit "uncomments both if both are in same comment"
fails unless you insert (`n)?.*. In that case, 1. breaks.
The tests are fairly easy to understand, if you look at [xml]$expected
and the two respective [xml]$inputXml
values. The code here is a fully functional Pester test suite to reproduce my issue. You might have to create C:\temp
or install Pester v5.
Import-Module Pester
Describe "Remove comments"{
BeforeAll {
function UncommentXmlNode {
param (
[String] $filePath,
[String] $innerXmlToUncomment
)
$content = Get-Content $filePath -Raw
$content -replace "<!--(?<InnerXml>$innerXmlToUncomment)-->", '${InnerXml}' | Set-Content -Path $filePath -Encoding utf8
}
function UncommentMyTwoNodes {
param (
[xml]$inputXml,
[string]$inputXmlPath
)
UncommentXmlNode -filePath $inputXmlPath -innerXmlToUncomment "<myFirstOutcommentedXml.*" #Add this to make second test work (`n)?.*
UncommentXmlNode -filePath $inputXmlPath -innerXmlToUncomment "<mySecondOutcommentedXml.*"
}
[xml]$expected = @"
<myXml>
<!-- comment I want to keep -->
<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
<mySecondOutcommentedXml attributeA="xy" attributeB="true" />
<myOtherXmlTag attributeC="value" />
<!-- comment I want to keep -->
</myXml>
"@
}
it "uncomments myFirstOutcommentedXml and mySecondOutcommentedXml"{
[xml]$inputXml = @"
<myXml>
<!-- comment I want to keep -->
<!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
<!--<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
<myOtherXmlTag attributeC="value" />
<!-- comment I want to keep -->
</myXml>
"@
$tempPath = "C:\temp\test.xml"
$inputXml.Save($tempPath)
UncommentMyTwoNodes -inputXml $inputXml -inputXmlPath $tempPath
[xml]$result = Get-Content $tempPath
$result.OuterXml | Should -be $expected.OuterXml
}
it "uncomments both if both are in same comment"{
[xml]$inputXml = @"
<myXml>
<!-- comment I want to keep -->
<!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
<myOtherXmlTag attributeC="value" />
<!-- comment I want to keep -->
</myXml>
"@
$tempPath = "C:\temp\test.xml"
$inputXml.Save($tempPath)
UncommentMyTwoNodes -inputXml $inputXml -inputXmlPath $tempPath
[xml]$result = Get-Content $tempPath
$result.OuterXml | Should -be $expected.OuterXml
}
}
CodePudding user response:
I made some changes to your code to make it easier to test::
- first of all just working with plain strings without converting to
[xml]
and calling.OuterXml
- second, just working with plain strings and not reading / writing to disk
- I've also removed all the Pester testing code for the sake of clarity
So, here's some test data to work with:
$expected = @"
<myXml>
<!-- comment I want to keep -->
<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
<mySecondOutcommentedXml attributeA="xy" attributeB="true" />
<myOtherXmlTag attributeC="value" />
<!-- comment I want to keep -->
</myXml>
"@
# two tags inside separate xml comments
$inputXml1 = @"
<myXml>
<!-- comment I want to keep -->
<!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
<!--<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
<myOtherXmlTag attributeC="value" />
<!-- comment I want to keep -->
</myXml>
"@
# two tags inside a single xml comment
$inputXml2 = @"
<myXml>
<!-- comment I want to keep -->
<!--<myFirstOutcommentedXml attributeA="xy" attributeB="true" />
<mySecondOutcommentedXml attributeA="xy" attributeB="true" />-->
<myOtherXmlTag attributeC="value" />
<!-- comment I want to keep -->
</myXml>
"@
Here's the updated functions:
function UncommentXmlNode
{
param
(
[string] $xml,
[string] $uncomment
)
return $xml -replace "(?s)<!--(?<InnerXml><$uncomment.*?)-->", '${InnerXml}'
# ^^^^ ^^^
# single-line (eats `n) lazy / non-greedy
}
function UncommentMyTwoNodes
{
param (
[string] $xml
)
$xml = UncommentXmlNode -xml $xml -uncomment "myFirstOutcommentedXml"
$xml = UncommentXmlNode -xml $xml -uncomment "mySecondOutcommentedXml"
return $xml
}
And here's some example usage:
(UncommentMyTwoNodes -xml $inputXml1) -eq $expected
# True
(UncommentMyTwoNodes -xml $inputXml2) -eq $expected
# True
The differences are:
enabling the single-line option in the regex -
(?s)
- "so that it matches every character, instead of matching every character except for the newline character \n"turning the greedy
.*
into a lazy.*?
by adding a lazy quantifier. This is needed because otherwise(?s)
above causes your-->
to match the last instance in the input string. Changing it to lazy makes it match the first-->
after the opening<!--
.
This works for both your test cases now, but you might find other edge-cases that still fail (including if $uncomment
contains regex escape chars)...
Epilogue
Treating xml as plain text isn't always the best plan. For example the above function will fail with simple pathological cases - for example:
- Whitespace in the element text - e.g.:
<!--< myFirstOutcommentedXml attributeA="xy" attributeB="true" />-->
^^^
A more robust approach would be to parse the xml and then process all the comment nodes to check their contents:
$comments = ([xml] "...").SelectNodes("//comment()")
foreach( $comment in $comments )
{
...
}