Home > Enterprise >  How to select only one xml tag from the duplicates in the whole file regardless of parents
How to select only one xml tag from the duplicates in the whole file regardless of parents

Time:11-23

I want to select only the first occurrence of codes.head with md.mnem="ht1" and "ht1c" tag from the whole file, regardless of its parent. My Xml file looks like this-

<printArtifactGroup>
    <!--Pubtags   : [ANIP , AN , ANIP, AN]Sourcetags: [21, 21-A1]-->
    <bov ID="I2C37E8404E1711DF8062B84BC6F3033A" legacy.identifier="000321783">
        <placeholder ID="I2C3836604E1711DF8062B84BC6F3033A" md.mnem="vols">
            <placeholder.text>0390 V. 0390 Ch. 75, Arts. 42-end (2008)</placeholder.text>
        </placeholder>
        <head.block ID="I2C385D704E1711DF8062B84BC6F3033A">
            <codes.head ID="I2C385D714E1711DF8062B84BC6F3033A" md.mnem="ht1">
                <head.info>
                    <headtext>
                        <ital>Wests pso1_1</ital>
                    </headtext>
                </head.info>
            </codes.head>
            <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="ht1c">
                <head.info>
                    <headtext> pso1_2</headtext>
                </head.info>
            </codes.head>
            <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
                <placeholder.text>UL</placeholder.text>
            </placeholder>
        </head.block>
        <head.block ID="I2C38D2A24E1711DF8062B84BC6F3033A">
            <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
                <head.info>
                    <label.name>CHAPTER</label.name>
                    <label.designator>75 pso1_4</label.designator>
                </head.info>
            </codes.head>
            <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
                <head.info>
                    <label.name>CHAPTER duplicate</label.name>
                    <label.designator>75 pso1_5</label.designator>
                </head.info>
            </codes.head>
            <codes.head ID="I2C38F9B14E1711DF8062B84BC6F3033A" md.mnem="hg2c">
                <head.info>
                    <headtext> pso1_6</headtext>
                </head.info>
            </codes.head>
            <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
                <placeholder.text>UL</placeholder.text>
            </placeholder>
        </head.block>
    </bov>
    <grade.content legacy.identifier="018840438" ID="I2C3158904E1711DFAB97E78B3969CA63">
        <head.block ID="I2C31CDC04E1711DFAB97E78B3969CA63">
            <codes.head ID="I2C385D714E1711DF8062B84BC6F3033A" md.mnem="ht1">
                <head.info>
                    <headtext>
                        <ital>pso1</ital>
                    </headtext>
                </head.info>
            </codes.head>
            <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="ht1c">
                <head.info>
                    <headtext>pso2</headtext>
                </head.info>
            </codes.head>
            <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="srnl">
                <head.info>
                    <headtext>pso 4</headtext>
                </head.info>
            </codes.head>
        </head.block>
    </grade.content>
</printArtifactGroup>

My XSLT scripts is -

<xsl:template match="codes.head">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template
            match="codes.head[@md.mnem[starts-with(.,'ht1')]][position() > 2]"/>

the output i'm getting

<?xml version="1.0" encoding="UTF-8"?>
<printArtifactGroup><!--Pubtags   : [ANIP , AN , ANIP, AN]Sourcetags: [21, 21-A1]-->
   <bov ID="I2C37E8404E1711DF8062B84BC6F3033A" legacy.identifier="000321783">
      <placeholder ID="I2C3836604E1711DF8062B84BC6F3033A" md.mnem="vols">
         <placeholder.text>0390 V. 0390 Ch. 75, Arts. 42-end (2008)</placeholder.text>
      </placeholder>
      <head.block ID="I2C385D704E1711DF8062B84BC6F3033A">
         <codes.head ID="I2C385D714E1711DF8062B84BC6F3033A" md.mnem="ht1">
            <head.info>
               <headtext>
                  <ital>Wests pso1_1</ital>
               </headtext>
            </head.info>
         </codes.head>
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="ht1c">
            <head.info>
               <headtext> pso1_2</headtext>
            </head.info>
         </codes.head>
         <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
            <placeholder.text>UL</placeholder.text>
         </placeholder>
      </head.block>
      <head.block ID="I2C38D2A24E1711DF8062B84BC6F3033A">
         <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
            <head.info>
               <label.name>CHAPTER</label.name>
               <label.designator>75 pso1_4</label.designator>
            </head.info>
         </codes.head>
         <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
            <head.info>
               <label.name>CHAPTER duplicate</label.name>
               <label.designator>75 pso1_5</label.designator>
            </head.info>
         </codes.head>
         <codes.head ID="I2C38F9B14E1711DF8062B84BC6F3033A" md.mnem="hg2c">
            <head.info>
               <headtext> pso1_6</headtext>
            </head.info>
         </codes.head>
         <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
            <placeholder.text>UL</placeholder.text>
         </placeholder>
      </head.block>
   </bov>
   <grade.content legacy.identifier="018840438" ID="I2C3158904E1711DFAB97E78B3969CA63">
      <head.block ID="I2C31CDC04E1711DFAB97E78B3969CA63">
         <codes.head ID="I2C385D714E1711DF8062B84BC6F3033A" md.mnem="ht1">
            <head.info>
               <headtext>
                  <ital>pso1</ital>
               </headtext>
            </head.info>
         </codes.head>
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="ht1c">
            <head.info>
               <headtext>pso2</headtext>
            </head.info>
         </codes.head>
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="srnl">
            <head.info>
               <headtext>pso 4</headtext>
            </head.info>
         </codes.head>
      </head.block>
   </grade.content>
</printArtifactGroup>

This is keeping all the first occurrence of ht1 and ht1c in every block not in the whole file. What should be the correct way to select only the first occurrence in the whole file?

desired output

<?xml version="1.0" encoding="UTF-8"?>
<printArtifactGroup><!--Pubtags   : [ANIP , AN , ANIP, AN]Sourcetags: [21, 21-A1]-->
   <bov ID="I2C37E8404E1711DF8062B84BC6F3033A" legacy.identifier="000321783">
      <placeholder ID="I2C3836604E1711DF8062B84BC6F3033A" md.mnem="vols">
         <placeholder.text>0390 V. 0390 Ch. 75, Arts. 42-end (2008)</placeholder.text>
      </placeholder>
      <head.block ID="I2C385D704E1711DF8062B84BC6F3033A">
         <codes.head ID="I2C385D714E1711DF8062B84BC6F3033A" md.mnem="ht1">
            <head.info>
               <headtext>
                  <ital>Wests pso1_1</ital>
               </headtext>
            </head.info>
         </codes.head>
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="ht1c">
            <head.info>
               <headtext> pso1_2</headtext>
            </head.info>
         </codes.head>
         <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
            <placeholder.text>UL</placeholder.text>
         </placeholder>
      </head.block>
      <head.block ID="I2C38D2A24E1711DF8062B84BC6F3033A">
         <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
            <head.info>
               <label.name>CHAPTER</label.name>
               <label.designator>75 pso1_4</label.designator>
            </head.info>
         </codes.head>
         <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
            <head.info>
               <label.name>CHAPTER duplicate</label.name>
               <label.designator>75 pso1_5</label.designator>
            </head.info>
         </codes.head>
         <codes.head ID="I2C38F9B14E1711DF8062B84BC6F3033A" md.mnem="hg2c">
            <head.info>
               <headtext> pso1_6</headtext>
            </head.info>
         </codes.head>
         <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
            <placeholder.text>UL</placeholder.text>
         </placeholder>
      </head.block>
   </bov>
   <grade.content legacy.identifier="018840438" ID="I2C3158904E1711DFAB97E78B3969CA63">
      <head.block ID="I2C31CDC04E1711DFAB97E78B3969CA63">
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="srnl">
            <head.info>
               <headtext>pso 4</headtext>
            </head.info>
         </codes.head>
      </head.block>
   </grade.content>
</printArtifactGroup>

CodePudding user response:

This should do it for you in XSLT3.0, which is shorter.

You can use the same approach in XSLT2.0. The difference with XSLT3.0 is that you can't use the mode on-no-match definitions in XSLT2.0 hence you'll need to supply all of the identity templates.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  version="3.0"
  exclude-result-prefixes="#all">

  <xsl:mode on-no-match="shallow-copy" />
  
  <xsl:output method="xml" indent="yes" />
  
  <xsl:variable name="mdMnems" as="xs:string*" select="('ht1', 'ht1c')" />
  
  <xsl:variable name="nodesToKeep" as="element(codes.head)*" 
                select="for $mdMnem in $mdMnems
                        return ((//codes.head[@md.mnem eq $mdMnem ])[1])" />
                          
  <xsl:template match="codes.head[@md.mnem = $mdMnems]">
    <xsl:if test="$some $node in $nodesToKeep satisfies $node is .">
      <xsl:next-match />
    </xsl:if>
  </xsl:template>
  
</xsl:stylesheet>

I'm getting this output (in @martin-honnen's excellent xslt3fiddle):

<?xml version="1.0" encoding="UTF-8"?>
<printArtifactGroup>
    <!--Pubtags   : [ANIP , AN , ANIP, AN]Sourcetags: [21, 21-A1]-->
    <bov ID="I2C37E8404E1711DF8062B84BC6F3033A" legacy.identifier="000321783">
      <placeholder ID="I2C3836604E1711DF8062B84BC6F3033A" md.mnem="vols">
         <placeholder.text>0390 V. 0390 Ch. 75, Arts. 42-end (2008)</placeholder.text>
      </placeholder>
      <head.block ID="I2C385D704E1711DF8062B84BC6F3033A">
         <codes.head ID="I2C385D714E1711DF8062B84BC6F3033A" md.mnem="ht1">
            <head.info>
               <headtext>
                  <ital>Wests pso1_1</ital>
               </headtext>
            </head.info>
         </codes.head>
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="ht1c">
            <head.info>
               <headtext> pso1_2</headtext>
            </head.info>
         </codes.head>
         <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
            <placeholder.text>UL</placeholder.text>
         </placeholder>
      </head.block>
      <head.block ID="I2C38D2A24E1711DF8062B84BC6F3033A">
         <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
            <head.info>
               <label.name>CHAPTER</label.name>
               <label.designator>75 pso1_4</label.designator>
            </head.info>
         </codes.head>
         <codes.head ID="I2C38F9B04E1711DF8062B84BC6F3033A" md.mnem="hg2">
            <head.info>
               <label.name>CHAPTER duplicate</label.name>
               <label.designator>75 pso1_5</label.designator>
            </head.info>
         </codes.head>
         <codes.head ID="I2C38F9B14E1711DF8062B84BC6F3033A" md.mnem="hg2c">
            <head.info>
               <headtext> pso1_6</headtext>
            </head.info>
         </codes.head>
         <placeholder ID="I2C3920C14E1711DF8062B84BC6F3033A" md.mnem="angen">
            <placeholder.text>UL</placeholder.text>
         </placeholder>
      </head.block>
   </bov>
   <grade.content legacy.identifier="018840438" ID="I2C3158904E1711DFAB97E78B3969CA63">
      <head.block ID="I2C31CDC04E1711DFAB97E78B3969CA63">
         <codes.head ID="I2C385D724E1711DF8062B84BC6F3033A" md.mnem="srnl">
            <head.info>
               <headtext>pso 4</headtext>
            </head.info>
         </codes.head>
      </head.block>
   </grade.content>
</printArtifactGroup>

CodePudding user response:

Here is my suggestion using an accumulator (XSLT 3 feature) spelled out:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:map="http://www.w3.org/2005/xpath-functions/map"
    exclude-result-prefixes="#all"
    version="3.0">
  
  <xsl:param name="codes" as="xs:string*" select="'ht1', 'ht1c'"/>
  
  <xsl:accumulator name="element-counter" as="map(xs:string, xs:integer)" initial-value="map:merge($codes ! map { . : 0 })">
    <xsl:accumulator-rule 
      match="codes.head[@md.mnem = $codes]"
      select="map:put($value, string(@md.mnem), $value(string(@md.mnem))   1)"/>
  </xsl:accumulator>

  <xsl:mode on-no-match="shallow-copy" use-accumulators="element-counter"/>

  <xsl:template match="codes.head[@md.mnem = $codes][accumulator-before('element-counter')(string(@md.mnem)) gt 1]"/>
  
</xsl:stylesheet>

Would even work with Saxon EE and streaming if you add streamable="yes" on the xsl:mode declaration and the xsl:accumulator declaration.

  • Related