Home > Enterprise >  split a large xml file into multiple parts using java
split a large xml file into multiple parts using java

Time:01-20

I have an xml file and I want to manipulate the tags using the Java DOM, but its size is 25 gega-octets, so its telling me I can't and shows me this error:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

    public Frwiki() {
        filePath = "D:\\compressed\\frwiki-latest-pages-articles.xml";
    }

    public void deletingTag() throws Exception {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        Document doc = factory.newDocumentBuilder().parse(filePath);
        NodeList nodes = doc.getElementsByTagName("*");

        for (int j = 0; j < 3; j  ) {
            for (int i = 0; i < nodes.getLength(); i  ) {
                Node node = nodes.item(i);
                if (!node.getNodeName().equals("id") && !node.getNodeName().equals("title")
                        && !node.getNodeName().equals("text") && !node.getNodeName().equals("mediawiki")
                        && !node.getNodeName().equals("revision") && !node.getNodeName().equals("page"))
                    node.getParentNode().removeChild(node);
            }
        }

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.transform(new DOMSource(doc), new StreamResult(filePath));
    }

CodePudding user response:

You can split a large file into smaller files using XSLT 3.0 streaming, like this:

<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
    <xsl:template name="xsl:initial-template">
      <xsl:source-document streamable="yes" href="frwiki-latest-pages-articles.xml">
        <xsl:for-each-group ....>
           <xsl:result-document href="......">
              <part><xsl:copy-of select="current-group()"/></part>
           </xsl:result-document>
        </xsl:for-each-group>
      </xsl:source-document>
    </xsl:template>
    
</xsl:transform>

The "..." parts depend on how you want to split the document and name the result files.

Although XSLT 3.0 streaming is a W3C specification, the only implementation available at the moment is my company's Saxon-EE processor.

CodePudding user response:

Split the large XML file into smaller chunks and process them separately.

  • Related