I'm working on some manuscript transcriptions in XML-TEI, and I'm using XSLT to transform it into a .tex document. My input document is made of tei:w
tokens that represent each word of the text. MWE:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml" schematypens="http://relaxng.org/ns/structure/1.0"?>
<?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng" type="application/xml"
<TEI xmlns="http://www.tei-c.org/ns/1.0">
I need to identify the words that are repeated in a certain range, say 10, to make LaTeX disambiguate them in the edition (with a command called \sameword
from reledmac package). For example, in
the above MWE, I want both ut
to be tagged with this command.
I think I've found a way to do this; my question is more about how to improve my code. With small documents, the template below seems to work just fine; but my corpus is made of 300.000 tokens, and the transformation is taking way too much time: the engine is evaluating the right and left contexts for each word...
<xsl:template match="tei:w">
<xsl:variable name="current_position" select="count(preceding::tei:w)"/>
<xsl:variable name="same_word_before"
select="preceding::tei:w[($current_position - 10) > count(preceding::tei:w)][not(count(preceding::tei:w) > $current_position)]/text() = text()"/>
<xsl:variable name="same_word_after"
select="following::tei:w[($current_position 10) > count(preceding::tei:w)][count(preceding::tei:w) > $current_position]/text() = text()"/>
<xsl:when test="$same_word_before or $same_word_after">
Is there a simpler and/or more efficient way to do this ? One solution I'm thinking of is to use python, but I would prefer to stick with xsl for this task.
Edit: I'm using XSLT 2.0.
CodePudding user response:
Not much different from what you did, still quite fast:
<xsl:template match="tei:w">
<xsl:variable name="preceding" as="xs:string*" select="preceding-sibling::tei:w[position() lt 11]/text()" />
<xsl:variable name="following" as="xs:string*" select="following-sibling::tei:w[position() lt 11]/text()" />
<xsl:when test="text()=($preceding,$following)">
I tested it with 2000 p's with each 50 words and it took 0.3 sec.
Since Xslt 2.0 we have build-in data-types They describe the kind of data of a variable/parameter/function.
<xsl:variable name="preceding" as="xs:string*"/>
means the variable can contain zero or more strings.Or
<xsl:variable name="firtsNextSibling" as="element()?"/>
means the variable can contain zero or one element.
<xsl:when test="text()=($preceding,$following)">
The meaning of this @test attribute of this when is that value of the current text()-node should exist in the combined $preceding and $following string-sequences.