XML XSLT Stream large xml file with SAXON EE10.6-CodePudding

I have to import large xml files (>5Gb) into SOLR. I want to transform a xml file first with SAXON EE10.6 and streaming xsl. I have read it should be possible with SAXON EE10.6, but I get the following error:

Error on line 20 column 34 of mytest.xsl: XTSE3430 Template rule is not streamable

There is more than one consuming operand: {<field {(attr{name=...}, ...)}/>} on line 21, and {xsl:apply-templates} on line 27
The result of the template rule can contain streamed nodes Template rule is not streamable
There is more than one consuming operand: {<field {(attr{name=...}, ...)}/>} on line 21, and {xsl:apply-templates} on line 27
The result of the template rule can contain streamed nodes

I am not familiar with streaming xslt and Saxon. How to get my xslt right for streaming to output the needed Solr add document xml.

I have a fiddle here with a simplified version of my xml and the xslt I use: https://xsltfiddle.liberty-development.net/asoTKU

It is working great for smaller xml files (<1Gb)

CodePudding user response：

Assuming your Properties elements and Category are "small" enough to be buffered I guess

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" expand-text="yes">

  <xsl:output method="xml" encoding="utf-8" indent="yes" />
  
  <xsl:strip-space elements="*"/>
  
  <xsl:mode streamable="yes" on-no-match="shallow-skip"/>
  
  <xsl:mode name="grounded"/>
  
  <xsl:template match="Properties | Category">
    <xsl:apply-templates select="copy-of()" mode="grounded"/>
  </xsl:template>
  
  <xsl:template match="Category" mode="grounded">
    <field name="Category">{.}</field>
    <xsl:apply-templates mode="#current"/>
  </xsl:template>
  
  <xsl:template match="Properties" mode="grounded">
    <field name="Properties">{.}</field>
    <xsl:apply-templates mode="#current"/>
  </xsl:template>
  
  <xsl:template match="Category/*" mode="grounded">
    <field name="CAT_{local-name()}_s">{.}</field>
  </xsl:template>

  <xsl:template match="Property" mode="grounded">
    <field name="{key}_s">{value}</field>
  </xsl:template>

  <xsl:template match="Item/*[not(self::Category | self::Properties)]">
    <field name="{local-name()}">{.}</field>
  </xsl:template>

  <xsl:template match='/Items'>
    <add>
      <xsl:apply-templates select="Item"/>
    </add>
  </xsl:template>

  <xsl:template match="Item">
    <xsl:variable name="pos" select="position()"/>
    <doc>
      <xsl:apply-templates>
        <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
      </xsl:apply-templates>
    </doc>
  </xsl:template>

</xsl:stylesheet>

But your code (doing <xsl:apply-templates select="Property"/> in <xsl:template match="Property">) suggests that perhaps Property elements can be recursively nested, that could then with arbitrary nesting cause memory problems if the code attempts, like done above, to buffer the first Property it encounters, using copy-of(), in memory.

Your sample XML, however, doesn't have any nested Property elements.

Part of the xsl:fork strategy I commented on is used in

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" expand-text="yes">

  <xsl:output method="xml" encoding="utf-8" indent="yes" />
  
  <xsl:strip-space elements="*"/>
  
  <xsl:mode streamable="yes"/>
  
  <xsl:mode name="text" streamable="yes"/>
  
  <xsl:mode name="grounded"/>
  
  <xsl:template match="Category">
    <xsl:apply-templates select="copy-of()" mode="grounded"/>
  </xsl:template>
  
  <xsl:template match="Properties">
    <xsl:fork>
      <xsl:sequence>
        <field name="Properties">
          <xsl:apply-templates mode="text"/>
        </field>
      </xsl:sequence>
      <xsl:sequence>
        <xsl:apply-templates/>
      </xsl:sequence>
    </xsl:fork>
  </xsl:template>
  
  <xsl:template match="Category" mode="grounded">
    <field name="Category">{.}</field>
    <xsl:apply-templates mode="#current"/>
  </xsl:template>
  
  <xsl:template match="Category/*" mode="grounded">
    <field name="CAT_{local-name()}_s">{.}</field>
  </xsl:template>
  
  <xsl:template match="Property">
    <xsl:apply-templates select="copy-of()" mode="grounded"/>
  </xsl:template>

  <xsl:template match="Property" mode="grounded">
    <field name="{key}_s">{value}</field>
  </xsl:template>

  <xsl:template match="Item/*[not(self::Category | self::Properties)]">
    <field name="{local-name()}">{.}</field>
  </xsl:template>

  <xsl:template match='/Items'>
    <add>
      <xsl:apply-templates select="Item"/>
    </add>
  </xsl:template>

  <xsl:template match="Item">
    <xsl:variable name="pos" select="position()"/>
    <doc>
      <xsl:apply-templates>
        <xsl:with-param name="pos"><xsl:value-of select="$pos"/></xsl:with-param>
      </xsl:apply-templates>
    </doc>
  </xsl:template>

</xsl:stylesheet>

That avoids explicitly constructing "a tree" for each Properties element but I have no idea what strategies Saxon applies to make sure both branches of the xsl:fork have access to the child or descendant contents.

CodePudding user response：

The rules for XSLT 3.0 streaming are incredibly complicated, and it doesn't help that there are few tutorial introductions. One extremely useful resource is Abel Braaksma's talk at XML Prague 2014: there's a transcript and a link to the YouTube recording at https://www.xfront.com/Transcript-of-Abel-Braaksma-talk-on-XSLT-Streaming-at-XML-Prague-2014.pdf

The most important rule to remember is: a template rule can only make one downward selection (it only gets one chance to scan the descendant tree). That's the rule you've broken when you wrote:

<xsl:template match="node()">
   <xsl:element name="field">
      <xsl:attribute name="name">
        <xsl:value-of select="local-name()"/>
      </xsl:attribute>
      <xsl:value-of select="."/>
   </xsl:element>
   <xsl:apply-templates select="*"/>
</xsl:template>

Actually, that code could be simplified to

<xsl:template match="node()">
   <field name="{local-name()}">{.}</field>
   <xsl:apply-templates select="*"/>
</xsl:template>

But this wouldn't affect the stream ability: you're processing the descendants of the matched node twice, once to get the string value (.), and once to apply-templates to the children.

Now, it looks to me as if this template rule is only being used to process "leaf elements", that is, elements that have a text node child but no child elements. If that's the case, then the <xsl:apply-templates select="*"/> never selects anything: it's redundant and it can be removed, which makes the rule streamable.

There's another error message you're getting, which is that the template rule can return streamed nodes. The reason it's not permitted to return streamed nodes is a bit more subtle; it basically makes it impossible for the processor to do the data flow analysis to prove whether or not streaming is feasible. But it's again the <xsl:apply-templates select="*"/> that's the cause of the problem and getting rid of it fixes things.

Your next problem is with the template rule for Property elements. You've written this as

   <xsl:template match="Property">
        <xsl:element name="field">
            <xsl:attribute name="name">
               <xsl:value-of select="key"/>_s</xsl:attribute>
            <xsl:value-of select="value"/>
        </xsl:element>
        <xsl:apply-templates select="Property"/>
    </xsl:template>

and it simplifies to:

<xsl:template match="Property">
    <field name="{key}_s">{value}</field>
    <xsl:apply-templates select="Property"/>
</xsl:template>

This is making three downward selections: child::key, child::value, and child::Property. In your data sample, no Property element has a child called Property, so perhaps the <xsl:apply-templates/> is again redundant. For key and value one useful trick is to read them into a map:

<xsl:template match="Property">
    <xsl:variable name="pair" as="map(*)">
      <xsl:map>
        <xsl:map-entry key="'key'" select="string(key)"/>
        <xsl:map-entry key="'value'" select="string(value)"/>
      </xsl:map>
    </xsl:variable>
    <field name="{$pair?key}_s">{$pair?value}</field>
</xsl:template>

The reason this works is that xsl:map (like xsl:fork) is an exception to the "one downward selection" rule - the map can be built up in a single pass of the input. By calling string(), we're careful not to put any streamed nodes into the map, so the data we need later has been captured in the map and we don't ever need to go back to the streamed input document to read it a second time.

I hope this gives you a feel for the way forward. Streaming in XSLT is not for the faint-hearted, but if you've got >5Gb input documents then you don't have many options open.