I'm working on some manuscript transcriptions in XML-TEI, and I'm using XSLT to transform it into a .tex document. My input document is made of tei:w
tokens that represent each word of the text. MWE:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
schematypens="http://purl.oclc.org/dsdl/schematron"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
...
</teiHeader>
<text>
<body>
<p><w>Lorem</w>
<w>ipsum</w>
<w>dolor</w>
<w>sit</w>
<w>amet</w>
<pc>,</pc>
<w>consectetur</w>
<w>adipiscing</w>
<w>elit</w>
<pc>,</pc>
<w>sed</w>
<w>do</w>
<w>eiusmod</w>
<w>tempor</w>
<w>incididunt</w>
<w>ut</w>
<w>labore</w>
<w>et</w>
<w>dolore</w>
<w>magna</w>
<w>aliqua</w>
<pc>;</pc>
<w>ut</w>
<w>enim</w>
<w>ad</w>
<w>minim</w>
<w>veniam</w>
</p>
</body>
</text>
</TEI>
I need to identify the words that are repeated in a certain range, say 10, to make LaTeX disambiguate them in the edition (with a command called \sameword
from reledmac package). For example, in
the above MWE, I want both ut
to be tagged with this command.
I think I've found a way to do this; my question is more about how to improve my code. With small documents, the template below seems to work just fine; but my corpus is made of 300.000 tokens, and the transformation is taking way too much time: the engine is evaluating the right and left contexts for each word...
<xsl:template match="tei:w">
<xsl:variable name="current_position" select="count(preceding::tei:w)"/>
<xsl:variable name="same_word_before"
select="preceding::tei:w[($current_position - 10) > count(preceding::tei:w)][not(count(preceding::tei:w) > $current_position)]/text() = text()"/>
<xsl:variable name="same_word_after"
select="following::tei:w[($current_position 10) > count(preceding::tei:w)][count(preceding::tei:w) > $current_position]/text() = text()"/>
...
<xsl:choose>
<xsl:when test="$same_word_before or $same_word_after">
<xsl:text>\sameword{</xsl:text>
<xsl:apply-templates/>
<xsl:text>}</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
...
</xsl:template>
Is there a simpler and/or more efficient way to do this ? One solution I'm thinking of is to use python, but I would prefer to stick with xsl for this task.
Edit: I'm using XSLT 2.0.
CodePudding user response:
Not much different from what you did, still quite fast:
<xsl:template match="tei:w">
<xsl:variable name="preceding" as="xs:string*" select="preceding-sibling::tei:w[position() lt 11]/text()" />
<xsl:variable name="following" as="xs:string*" select="following-sibling::tei:w[position() lt 11]/text()" />
<xsl:choose>
<xsl:when test="text()=($preceding,$following)">
<xsl:text>\sameword{</xsl:text>
<xsl:apply-templates/>
<xsl:text>}</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
I tested it with 2000 p's with each 50 words and it took 0.3 sec.
Since Xslt 2.0 we have build-in data-types They describe the kind of data of a variable/parameter/function.
I.e.
<xsl:variable name="preceding" as="xs:string*"/>
means the variable can contain zero or more strings.Or
<xsl:variable name="firtsNextSibling" as="element()?"/>
means the variable can contain zero or one element.
<xsl:when test="text()=($preceding,$following)">
The meaning of this @test attribute of this when is that value of the current text()-node should exist in the combined $preceding and $following string-sequences.