The following XSLT takes hours to run. Most XSLT that I run takes seconds or minutes. What am I doing wrong? The goal is to take Word XHTML and convert it to a flat file to import into a dictionary program called FLEx. This is just one step in identifying the pieces of the dictionary. I have an input XHTML file of 52K. I do the conversion in 27 steps. The initial ones are done using Saxon and XSLT. The final steps are done with a special program called CC which predates AWK and Pearl. It is a string replacement tool that is very efficient. It takes seconds to process the files in CC. The first 8 steps are XSLT and take forever (more than 3 hours) to run each step. The last XSLT flattens the file so it is no longer in XML. CC works on the text file.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE stylesheet [
<!ENTITY cr "
">
<!ENTITY tab "	">
<!ENTITY nbsp " ">
]><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="xs"
xmlns="http://www.w3.org/1999/xhtml"
xmlns:html="http://www.w3.org/1999/xhtml"
version="3.0">
<xsl:strip-space elements="*"/>
<xsl:output encoding="UTF-8" indent="yes"/>
<xsl:template match="html:span[@class='source']" priority='4'>
<xsl:element name="span">
<xsl:attribute name="class">source</xsl:attribute>
<xsl:value-of select="normalize-space(.)"/>
</xsl:element>
</xsl:template>
<xsl:template match="html">
<html>
<xsl:apply-templates/>
</html>
</xsl:template>
<xsl:template match="html:i" priority="1">
<xsl:element name="span">
<xsl:attribute name="class">2.3 italic</xsl:attribute>
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
<xsl:template match="html:b">
<xsl:choose>
<xsl:when test="contains(.,'Derivation')">
<xsl:element name="span">
<xsl:attribute name="class">2.3dd derivation</xsl:attribute>
</xsl:element>
</xsl:when>
<xsl:when test="ancestor::*[contains(@class,'1.4 lx')]">
<xsl:apply-templates/>
</xsl:when>
<xsl:when test="ends-with(.,'-')">
<xsl:element name="span">
<xsl:attribute name="class">2.3a variant-none</xsl:attribute>
<xsl:value-of select="."/>
</xsl:element></xsl:when>
<xsl:when test="preceding::*[1]=preceding::html:br[1] and not(contains(.,'Forms')) and not(starts-with(following::text()[1],'(')[1])">
<xsl:element name="span">
<xsl:attribute name="class">2.3 variant-space</xsl:attribute>
<xsl:apply-templates/>
<!-- <xsl:value-of select="."/> joins that we don't want-->
</xsl:element></xsl:when>
<xsl:when test="preceding::*[1]=preceding::html:br[1] and not(contains(.,'Forms'))" >
<xsl:element name="span">
<xsl:attribute name="class">2.3 variant-none</xsl:attribute>
<xsl:value-of select="."/>
</xsl:element></xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="html:br">
<xsl:choose>
<xsl:when test="starts-with(following::text()[1],' (')">
<xsl:call-template name="makeLineBreak"/>
<xsl:element name="span">
<xsl:attribute name="class">2.3 text-(</xsl:attribute>
</xsl:element>
<xsl:apply-templates/>
</xsl:when>
<xsl:when test="following::*[1]=following::html:span[@class='MsoHyperlink'][following::*[1]=following::html:b[1]]">
<xsl:call-template name="makeLineBreak"/>
<xsl:copy-of select="."/>
</xsl:when>
<xsl:when test="following::*[1]=following::html:span[@class='MsoHyperlink']">
<xsl:call-template name="makeLineBreak"/>
<xsl:copy-of select="."/>
</xsl:when>
<xsl:when test="following::*[1]=following::html:span[@class='Arial'][1]">
<!-- 2.3 -->
</xsl:when>
<xsl:when test="starts-with(following::text()[1],'variant of')">
<!-- 2.3 variant of-->
</xsl:when>
<xsl:when test="following::*[1]=following::html[b][1] and contains(following::html:b[1],'Forms')">
<xsl:call-template name="makeLineBreak"/>
<xsl:copy-of select="."/>
</xsl:when>
<xsl:when test="preceding::text()[1]=')'">
<!-- 2.3 -->
<xsl:call-template name="makeLineBreak"/>
<xsl:element name="span">
<xsl:attribute name="class">2.3b definition</xsl:attribute>
<xsl:call-template name="processBold"/>
</xsl:element>
</xsl:when>
<xsl:when test="starts-with(following::text()[1],'(')">
<!-- 2.3 -->
<xsl:call-template name="makeLineBreak"/>
<xsl:element name="span">
<xsl:attribute name="class">2.3 gid</xsl:attribute>
<xsl:call-template name="processBold"/>
</xsl:element>
</xsl:when>
<xsl:when test="starts-with(following::text()[1],'(')">
<!-- 2.3 -->
<xsl:call-template name="makeLineBreak"/>
<xsl:element name="span">
<xsl:attribute name="class">2.3 gid</xsl:attribute>
<xsl:call-template name="processBold"/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
<!-- 2.3 -->
<xsl:call-template name="makeLineBreak"/>
<xsl:element name="span">
<xsl:attribute name="class">2.3 definition</xsl:attribute>
<xsl:call-template name="processBold"/>
</xsl:element>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="processBold">
<xsl:choose>
<xsl:when test="self::html:b and preceding::*[1]=preceding::span[1][@class='vernacular']">
2.3 vernacular bold
<xsl:text> </xsl:text>
</xsl:when>
<xsl:when test="self::html:b">
2.3 bold
<xsl:apply-templates select="."/>
<xsl:text> </xsl:text>
</xsl:when>
<xsl:when test="self::html:span[@class='Arial']">
<xsl:element name="span">
<xsl:attribute name="class">2.3a definition</xsl:attribute>
<xsl:value-of select="."/>
</xsl:element>
</xsl:when>
<xsl:otherwise>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template match="html:span[@lang][parent::html:b]" priority="1">
<xsl:choose>
<xsl:when test="preceding::*[1]=preceding::html:br[1]">
<xsl:element name="span">
<xsl:attribute name="class">2.3va variant-none</xsl:attribute>
<xsl:apply-templates/>
</xsl:element>
</xsl:when>
<xsl:when test=".='/'">
<xsl:value-of select="."/>
</xsl:when>
<xsl:when test="contains(.,'-')">
<xsl:text> </xsl:text>
<xsl:apply-templates/>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates/>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<xsl:template name="makeLineBreak">
<xsl:text>
</xsl:text>
</xsl:template>
<!-- identify transform -->
<xsl:template match="@*|*|processing-instruction()|comment()">
<xsl:copy>
<xsl:apply-templates select="*|@*|text()|processing-instruction()|comment()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
CodePudding user response:
You have quite a few expressions like
test="following::*[1]=following::html:span[@class='MsoHyperlink'][following::*[1]=following::html:b[1]]">
In principle, evaluating the following
(or preceding
) axis takes time proportional to the size of the document. However, if it's followed by the predicate [1]
then the search stops on hitting the first following node, which makes it run in constant time (i.e. time independent of document size). Three of your calls to following
in this expression fall under that rule; the fourth (following::html:span[@class='MsoHyperlink']
) does not. So this particular test is going to take time proportional to document size. You're evaluating this test once for every br
element, so the number of times you evaluate it is presumably proportional to document size; this makes the overall cost O(n^2).
Very often, people use preceding
and following
where preceding-sibling
and following-sibling
would be more appropriate. I've no idea if that's the case here.
I suspect that in most of these expressions you are using "=" where you should be using "is". An "=" test on an element with a large subtree is very expensive (at least proportional to the size of the tree being compared).
You could start by staring at the code looking for obvious inefficiencies like these, or you could start with performance measurement and analysis of the results. When faced with large amounts of code, especially if it's unfamiliar code, the second approach is usually more productive. Start by getting the -TP:profile.html
output to see if it identifies obvious hot-spots. Also, of course, get the timings for each of your 27 steps and decide which of them to focus on.