I need to evaluate multiple XPath (or possibly XQuery - I have some freedom to change the design here) expressions over a larger number of huge XML documents, potentially gigabyte size. If the files were small I could easily evaluate the expressions one by one with a DOM tree. If there was only one expression I might be able to evaluate it in streaming mode. However, I have found no solution for efficiently evaluating multiple expressions in streaming mode, i.e. without making multiple passes.
There are research papers. XTREAM looks pretty good, but though the paper was written in 2005 I can find no implementation. This is even older, but still I can find no implementation.
Is there a library (ideally in Java and ideally open source) that can do this?
CodePudding user response:
With XSLT 3.0 streaming (which in practice means Saxon-EE [my company's product], since EXSELT seems to have gone off-air), you can use xsl:fork
to evaluate multiple streaming XPath expressions in a single pass over the input, for example
<xsl:source-document href="input.xml">
<xsl:fork>
<xsl:sequence>
<xsl:result-document href="out1.xml">
<out1>{count(//a}</out1>
</xsl:result-document>
</xsl:sequence>
<xsl:sequence>
<xsl:result-document href="out2.xml">
<out2>{count(//b}</out1>
</xsl:result-document>
</xsl:sequence>
</xsl:fork>
</xsl:source-document>
To run this over multiple source documents, you can use <xsl:for-each select="collection(....)"/>
and with Saxon-EE you can add saxon:threads="n"
to process multiple inputs in parallel.
Sorry, this isn't open source - this isn't the kind of technology that you can implement in a spare weekend.