Home > Software design >  XSLT Efficiently identify repeated nodes within a given range
XSLT Efficiently identify repeated nodes within a given range

Time:03-03

I'm working on some manuscript transcriptions in XML-TEI, and I'm using XSLT to transform it into a .tex document. My input document is made of tei:w tokens that represent each word of the text. MWE:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
    schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
   <teiHeader>
      ...
   </teiHeader>
   <text>
      <body>
         <p><w>Lorem</w>
            <w>ipsum</w>
            <w>dolor</w>
            <w>sit</w>
            <w>amet</w>
            <pc>,</pc>
            <w>consectetur</w>
            <w>adipiscing</w>
            <w>elit</w>
            <pc>,</pc>
            <w>sed</w>
            <w>do</w>
            <w>eiusmod</w>
            <w>tempor</w>
            <w>incididunt</w>
            <w>ut</w>
            <w>labore</w>
            <w>et</w>
            <w>dolore</w>
            <w>magna</w>
            <w>aliqua</w>
            <pc>;</pc>
            <w>ut</w>
            <w>enim</w>
            <w>ad</w>
            <w>minim</w>
            <w>veniam</w>
         </p>
      </body>
   </text>
</TEI>

I need to identify the words that are repeated in a certain range, say 10, to make LaTeX disambiguate them in the edition (with a command called \sameword from reledmac package). For example, in the above MWE, I want both ut to be tagged with this command.

I think I've found a way to do this; my question is more about how to improve my code. With small documents, the template below seems to work just fine; but my corpus is made of 300.000 tokens, and the transformation is taking way too much time: the engine is evaluating the right and left contexts for each word...

 <xsl:template match="tei:w">
        <xsl:variable name="current_position" select="count(preceding::tei:w)"/>
        <xsl:variable name="same_word_before"
            select="preceding::tei:w[($current_position - 10) > count(preceding::tei:w)][not(count(preceding::tei:w) > $current_position)]/text() = text()"/>
        <xsl:variable name="same_word_after"
            select="following::tei:w[($current_position   10) > count(preceding::tei:w)][count(preceding::tei:w) > $current_position]/text() = text()"/>
        ...
        <xsl:choose>
            <xsl:when test="$same_word_before or $same_word_after">
                <xsl:text>\sameword{</xsl:text>
                <xsl:apply-templates/>
                <xsl:text>}</xsl:text>
            </xsl:when>
            <xsl:otherwise>
                <xsl:apply-templates/>
            </xsl:otherwise>
        </xsl:choose>
        ...
    </xsl:template>

Is there a simpler and/or more efficient way to do this ? One solution I'm thinking of is to use python, but I would prefer to stick with xsl for this task.

Edit: I'm using XSLT 2.0.

CodePudding user response:

Not much different from what you did, still quite fast:

  <xsl:template match="tei:w">
    <xsl:variable name="preceding"  as="xs:string*" select="preceding-sibling::tei:w[position() lt 11]/text()" />
    <xsl:variable name="following"  as="xs:string*" select="following-sibling::tei:w[position() lt 11]/text()" />
    <xsl:choose>
      <xsl:when test="text()=($preceding,$following)">
        <xsl:text>\sameword{</xsl:text>
        <xsl:apply-templates/>
        <xsl:text>}</xsl:text>
      </xsl:when>
      <xsl:otherwise>
        <xsl:apply-templates/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

I tested it with 2000 p's with each 50 words and it took 0.3 sec.

Since Xslt 2.0 we have build-in data-types They describe the kind of data of a variable/parameter/function.

  • I.e. <xsl:variable name="preceding" as="xs:string*"/> means the variable can contain zero or more strings.

  • Or <xsl:variable name="firtsNextSibling" as="element()?"/> means the variable can contain zero or one element.

<xsl:when test="text()=($preceding,$following)">

The meaning of this @test attribute of this when is that value of the current text()-node should exist in the combined $preceding and $following string-sequences.

  • Related