Home > Software engineering >  Merge xml files and delete duplicates when certain line field value already exists
Merge xml files and delete duplicates when certain line field value already exists

Time:06-06

I have two XML files of the same structure, each with over 100,000 entries and would like to merge them. File1 shall be my source file and I only would like to add entries from File2 when the value of field VATRegistrationNumber does not already exists in File1. Lines with an empty VATRegistrationNumber still need to be added to the output file. I'm after a resourceful solution that does not use a lot of memory as I would need to run the merger on a daily basis in a Windows task scheduler. So I was thinking either XSLT, powershell or batch.

File1:

  <Supplier SupplierNumber="300008755" CorporateGroupID="1000" Name1="PETER GOW" City="LISMORE" VATRegistrationNumber="60185579027" />
  <Supplier SupplierNumber="300008754" CorporateGroupID="1000" Name1="DANIEL PETRUSHNKO" City="SYDNEY" VATRegistrationNumber="52216736429" />
  <Supplier SupplierNumber="300008753" CorporateGroupID="1000" Name1="PEOPLE2PEOPLE" City="SYDNEY" VATRegistrationNumber="" />
  <Supplier SupplierNumber="300008752" CorporateGroupID="1000" Name1="PEARL CLEANING" City="WOONONA" VATRegistrationNumber="29169648828" />
  <Supplier SupplierNumber="300008751" CorporateGroupID="1000" Name1="PATTO'S PAINTLESS" City="TAMWORTH" VATRegistrationNumber="30482156828" />
  <Supplier SupplierNumber="300008750" CorporateGroupID="1000" Name1="PATRICK ST CLINIC" City="PENGUIN" VATRegistrationNumber="53220182389"  />
  <Supplier SupplierNumber="300008749" CorporateGroupID="1000" Name1="MITCHELL PAISH" City="SYDNEY" VATRegistrationNumber="42091590585" />
  <Supplier SupplierNumber="300008748" CorporateGroupID="1000" Name1="PACNET INTERNET" City="MELBOURNE" VATRegistrationNumber="69085213690" />
  <Supplier SupplierNumber="300008747" CorporateGroupID="1000" Name1="NICHOLAS J OWENS" City="SYDNEY" VATRegistrationNumber="32168537004" />
  <Supplier SupplierNumber="300008746" CorporateGroupID="1000" Name1="JOHN O'SULLIVAN" City="SYDNEY" VATRegistrationNumber="51136629180" />
  <Supplier SupplierNumber="300008745" CorporateGroupID="1000" Name1="OSTAR ELECTRICAL" City="BANGALOW" VATRegistrationNumber="47196051022" />
  <Supplier SupplierNumber="300008756" CorporateGroupID="1000" Name1="PKL COURIERS" City="PALM BEACH" VATRegistrationNumber="90690407812" />

File2:

  <Supplier SupplierNumber="9008788" CorporateGroupID="2000" Name1="MICHAEL JOHN MCMANUS" City="COBAR" VATRegistrationNumber="51136629180" />
  <Supplier SupplierNumber="9008787" CorporateGroupID="2000" Name1="MOREE PROPERTY" City="MOREE" VATRegistrationNumber="52216736429" />
  <Supplier SupplierNumber="9008786" CorporateGroupID="2000" Name1="PATRICIA GOULD" City="MUDGEE" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008785" CorporateGroupID="2000" Name1="KAREN DRUCE" City="Warrimoo" VATRegistrationNumber="42091590585" />
  <Supplier SupplierNumber="9008784" CorporateGroupID="2000" Name1="SOLAR SOLUTIONS" City="MUDGEE" VATRegistrationNumber="65164304054" />
  <Supplier SupplierNumber="9008783" CorporateGroupID="2000" Name1="ST GEORGES CHURCH" City="Hurstville" VATRegistrationNumber="79115673239" />
  <Supplier SupplierNumber="9008782" CorporateGroupID="2000" Name1="SHARED VALUE" City="PORT MELBOURNE" VATRegistrationNumber="32168537004" />
  <Supplier SupplierNumber="9008781" CorporateGroupID="2000" Name1="ALLEN KIRK" City="STUARTS POINT" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008775" CorporateGroupID="2000" Name1="NATIONAL REMOVALS" City="ASHMORE" VATRegistrationNumber="19748626605" />
  <Supplier SupplierNumber="9008774" CorporateGroupID="2000" Name1="INLAND CONTAINER" City="DUBBO" VATRegistrationNumber="78001638777" />
  <Supplier SupplierNumber="9008814" CorporateGroupID="2000" Name1="CTARS PTY LTD" City="SYDNEY" VATRegistrationNumber="59603816593" />
  <Supplier SupplierNumber="9008813" CorporateGroupID="2000" Name1="JENNY &amp; MARSH" City="LAMBTON" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008812" CorporateGroupID="2000" Name1="ACDMA CENTRE" City="Canley Vale" VATRegistrationNumber="58002378731" />
  <Supplier SupplierNumber="9008999" CorporateGroupID="2000" Name1="CENTRAL HOTEL" City="Sydney" VATRegistrationNumber="47196051022" />
  <Supplier SupplierNumber="9008969" CorporateGroupID="2000" Name1="ALES LICHNER" City="BANGOR" VATRegistrationNumber="" />

Expected Output file:

  <Supplier SupplierNumber="300008755" CorporateGroupID="1000" Name1="PETER GOW" City="LISMORE" VATRegistrationNumber="60185579027" />
  <Supplier SupplierNumber="300008754" CorporateGroupID="1000" Name1="DANIEL PETRUSHNKO" City="SYDNEY" VATRegistrationNumber="52216736429" />
  <Supplier SupplierNumber="300008753" CorporateGroupID="1000" Name1="PEOPLE2PEOPLE" City="SYDNEY" VATRegistrationNumber="" />
  <Supplier SupplierNumber="300008752" CorporateGroupID="1000" Name1="PEARL CLEANING" City="WOONONA" VATRegistrationNumber="29169648828" />
  <Supplier SupplierNumber="300008751" CorporateGroupID="1000" Name1="PATTO'S PAINTLESS" City="TAMWORTH" VATRegistrationNumber="30482156828" />
  <Supplier SupplierNumber="300008750" CorporateGroupID="1000" Name1="PATRICK ST CLINIC" City="PENGUIN" VATRegistrationNumber="53220182389"  />
  <Supplier SupplierNumber="300008749" CorporateGroupID="1000" Name1="MITCHELL PAISH" City="SYDNEY" VATRegistrationNumber="42091590585" />
  <Supplier SupplierNumber="300008748" CorporateGroupID="1000" Name1="PACNET INTERNET" City="MELBOURNE" VATRegistrationNumber="69085213690" />
  <Supplier SupplierNumber="300008747" CorporateGroupID="1000" Name1="NICHOLAS J OWENS" City="SYDNEY" VATRegistrationNumber="32168537004" />
  <Supplier SupplierNumber="300008746" CorporateGroupID="1000" Name1="JOHN O'SULLIVAN" City="SYDNEY" VATRegistrationNumber="51136629180" />
  <Supplier SupplierNumber="300008745" CorporateGroupID="1000" Name1="OSTAR ELECTRICAL" City="BANGALOW" VATRegistrationNumber="47196051022" />
  <Supplier SupplierNumber="300008756" CorporateGroupID="1000" Name1="PKL COURIERS" City="PALM BEACH" VATRegistrationNumber="90690407812" />
  <Supplier SupplierNumber="9008786" CorporateGroupID="2000" Name1="PATRICIA GOULD" City="MUDGEE" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008784" CorporateGroupID="2000" Name1="SOLAR SOLUTIONS" City="MUDGEE" VATRegistrationNumber="65164304054" />
  <Supplier SupplierNumber="9008783" CorporateGroupID="2000" Name1="ST GEORGES CHURCH" City="Hurstville" VATRegistrationNumber="79115673239" />
  <Supplier SupplierNumber="9008781" CorporateGroupID="2000" Name1="ALLEN KIRK" City="STUARTS POINT" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008775" CorporateGroupID="2000" Name1="NATIONAL REMOVALS" City="ASHMORE" VATRegistrationNumber="19748626605" />
  <Supplier SupplierNumber="9008774" CorporateGroupID="2000" Name1="INLAND CONTAINER" City="DUBBO" VATRegistrationNumber="78001638777" />
  <Supplier SupplierNumber="9008814" CorporateGroupID="2000" Name1="CTARS PTY LTD" City="SYDNEY" VATRegistrationNumber="59603816593" />
  <Supplier SupplierNumber="9008813" CorporateGroupID="2000" Name1="JENNY &amp; MARSH" City="LAMBTON" VATRegistrationNumber="" />
  <Supplier SupplierNumber="9008812" CorporateGroupID="2000" Name1="ACDMA CENTRE" City="Canley Vale" VATRegistrationNumber="58002378731" />
  <Supplier SupplierNumber="9008969" CorporateGroupID="2000" Name1="ALES LICHNER" City="BANGOR" VATRegistrationNumber="" />

Happy for any help I can get. Thanks

CodePudding user response:

@jana I have some similar situations, with rather large files, and the solution I came up with is to create two arrays of the VATRegistration numbers, Then you can quickly loop through the second array, testing for each VAT# in the first array, and import the Supplier element if not found. You minimize XML work by doing your decision making with simple arrays. The issue becomes, do you have duplicates in the files now, and what do you do with elements with a blank VAT#, which I see does occur.

CodePudding user response:

Let the XSL transformation declare a key and two variables for the two files:

<xsl:key name="vat" match="Supplier[@VATRegistrationNumber!='']" use="@VATRegistrationNumber"/>
<xsl:variable name="file1" select="document('file1')"/>
<xsl:variable name="file2" select="document('file2')"/>

The condition in the match ensures that a Supplier with empty VATRegistrationNumber has no key.

Then the desired Suppliers can be output with the following statements:

<xsl:copy-of select="$file1//Supplier"/>
<xsl:for-each select="$file2//Supplier">
  <!-- Do not output this Supplier if its key exists in $file1. -->
  <xsl:if test="$file1[not(key('vat',current()/@VATRegistrationNumber))]">
    <xsl:copy-of select="."/>
  </xsl:if>
</xsl:for-each>

The key condition is enclosed in $file1[...] in order to switch the context node to file 1, because that is the context in which the key is determined.

  • Related