Home > Mobile >  Delete duplicates from XML but keep lines when attribute value is blank
Delete duplicates from XML but keep lines when attribute value is blank

Time:06-10

I have an XML file of below structure and want to delete all lines that contain a duplicate VATRegistrationNumber but I still need to keep all lines that contain an empty VATRegistrationNumber.

<Suppliers xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns="urn:supplier:erp-integration">
  <Supplier SupplierNumber="300008755" CorporateGroupID="1000" Name1="PETER GOW" City="LISMORE" VATRegistrationNumber="60185579027" />
  <Supplier SupplierNumber="300008754" CorporateGroupID="1000" Name1="DANIEL PETRUSHNKO" City="SYDNEY" VATRegistrationNumber="52216736429" />
  <Supplier SupplierNumber="300008753" CorporateGroupID="1000" Name1="PEOPLE2PEOPLE" City="SYDNEY" VATRegistrationNumber="" />
  <Supplier SupplierNumber="300008752" CorporateGroupID="1000" Name1="PEARL CLEANING" City="WOONONA" VATRegistrationNumber="29169648828" />
  <Supplier SupplierNumber="300008751" CorporateGroupID="1000" Name1="PATTO'S PAINTLESS" City="TAMWORTH" VATRegistrationNumber="30482156828" />
  <Supplier SupplierNumber="300008750" CorporateGroupID="1000" Name1="PATRICK ST CLINIC" City="PENGUIN" VATRegistrationNumber="53220182389" />
  <Supplier SupplierNumber="300008749" CorporateGroupID="1000" Name1="MITCHELL PAISH" City="SYDNEY" VATRegistrationNumber="42091590585" />
  <Supplier SupplierNumber="300008748" CorporateGroupID="1000" Name1="PACNET INTERNET" City="MELBOURNE" VATRegistrationNumber="69085213690" />
  <Supplier SupplierNumber="300008747" CorporateGroupID="1000" Name1="NICHOLAS J OWENS" City="SYDNEY" VATRegistrationNumber="32168537004" />
  <Supplier SupplierNumber="300008746" CorporateGroupID="1000" Name1="JOHN O'SULLIVAN" City="SYDNEY" VATRegistrationNumber="51136629180" />
  <Supplier SupplierNumber="300008745" CorporateGroupID="1000" Name1="OSTAR ELECTRICAL" City="BANGALOW" VATRegistrationNumber="47196051022" />
  <Supplier SupplierNumber="300008756" CorporateGroupID="1000" Name1="PKL COURIERS" City="PALM BEACH" VATRegistrationNumber="90690407812" />
  <Supplier SupplierNumber="9008788" CorporateGroupID="2000" Name1="MICHAEL JOHN MCMANUS" City="COBAR" VATRegistrationNumber="51136629180" />
  <Supplier SupplierNumber="9008787" CorporateGroupID="2000" Name1="MOREE PROPERTY" City="MOREE" VATRegistrationNumber="52216736429" />
  <Supplier SupplierNumber="9008786" CorporateGroupID="2000" Name1="PATRICIA GOULD" City="MUDGEE" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008785" CorporateGroupID="2000" Name1="KAREN DRUCE" City="Warrimoo" VATRegistrationNumber="42091590585" />
  <Supplier SupplierNumber="9008784" CorporateGroupID="2000" Name1="SOLAR SOLUTIONS" City="MUDGEE" VATRegistrationNumber="65164304054" />
  <Supplier SupplierNumber="9008783" CorporateGroupID="2000" Name1="ST GEORGES CHURCH" City="Hurstville" VATRegistrationNumber="79115673239" />
  <Supplier SupplierNumber="9008782" CorporateGroupID="2000" Name1="SHARED VALUE" City="PORT MELBOURNE" VATRegistrationNumber="32168537004" />
  <Supplier SupplierNumber="9008781" CorporateGroupID="2000" Name1="ALLEN KIRK" City="STUARTS POINT" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008775" CorporateGroupID="2000" Name1="NATIONAL REMOVALS" City="ASHMORE" VATRegistrationNumber="19748626605" />
  <Supplier SupplierNumber="9008774" CorporateGroupID="2000" Name1="INLAND CONTAINER" City="DUBBO" VATRegistrationNumber="78001638777" />
  <Supplier SupplierNumber="9008814" CorporateGroupID="2000" Name1="CTARS PTY LTD" City="SYDNEY" VATRegistrationNumber="59603816593" />
  <Supplier SupplierNumber="9008813" CorporateGroupID="2000" Name1="JENNY &amp; MARSH" City="LAMBTON" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008812" CorporateGroupID="2000" Name1="ACDMA CENTRE" City="Canley Vale" VATRegistrationNumber="58002378731" />
  <Supplier SupplierNumber="9008999" CorporateGroupID="2000" Name1="CENTRAL HOTEL" City="Sydney" VATRegistrationNumber="47196051022" />
  <Supplier SupplierNumber="9008969" CorporateGroupID="2000" Name1="ALES LICHNER" City="BANGOR" VATRegistrationNumber="" />
</Suppliers>

Here is the powershell code that I have so far but it is currently skipping lines containing an empty VATRegistrationNumber which I need to keep.

[xml]$myfile = [xml](Get-content "input.xml")
$nodesToRemove = $myfile.suppliers.childnodes | Group-Object VATRegistrationNumber |
    Foreach-Object { $_.Group | Select-Object -Skip 1 }
    
$nodesToRemove | Foreach-Object {
    $myfile.suppliers.RemoveChild($_)
}
#save XML
$myfile.Save("result.xml")

CodePudding user response:

To stop also deleting nodes with empty VATRegistrationNumber attributes, you can do as below:

# use full, absolute paths for these
$inputFile  = 'X:\somewhere\input.xml'
$outputFile = 'X:\somewhere\result.xml'
# load the xml file. This way, you are ensured to get the file encoding correct
$xmlDoc = [System.Xml.XmlDocument]::new()
$xmlDoc.Load($inputFile)

# group nodes on the not empty VATRegistrationNumber attribute and store only those that have duplicates
$nodes = $xmlDoc.suppliers.childnodes | Where-Object { $_.VATRegistrationNumber -match '\S' } | 
         Group-Object VATRegistrationNumber | Where-Object { $_.Count -gt 1 }
# loop over the childnodes
foreach ($group in $nodes) {
    $group.Group | Select-Object -Skip 1 | ForEach-Object {
        [void]$_.ParentNode.RemoveChild($_)
    }
}
# finally save the new xml
$xmlDoc.Save($outputFile)
  • Related