Home > Blockchain >  Saxon out of memory when processing OpenStreetMap notes file from Planet
Saxon out of memory when processing OpenStreetMap notes file from Planet

Time:12-06

I am trying to process the OpenStreetMap notes file from the Planet that contains the whole history of notes (more than 3 million notes), and all of them are in a huge XML: https://planet.openstreetmap.org/notes/

The XML is a bit more than a 1 GB size and I can only process it with Saxon HE in big machines with more than 6 GB of RAM; otherwise, I hit the Out of memory exception in Java.

The command I am running is this:

java -Xmx6000m -cp saxon-he-11.4.jar net.sf.saxon.Transform \
   -s:"planet-notes-latest.osn.xml" -xsl:"notes-csv.xslt" -o:"planet-notes.csv"

But it requires 6 GB of RAM, which is a lot. How can I configure Saxon to use the memory better from the Command line? Ideally, I need to run on a Raspberry 4. Or what other tool can I use to process this file with a simple structure?

The whole code is at: https://github.com/OSMLatam/OSM-Notes-profile

The XSD file is:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" />
<xsl:template match="/">
 <xsl:for-each select="osm-notes/note"><xsl:value-of select="@id"/>,<xsl:value-of select="@lat"/>,<xsl:value-of select="@lon"/>,"<xsl:value-of select="@created_at"/>",<xsl:choose><xsl:when test="@closed_at != ''">"<xsl:value-of select="@closed_at"/>","close"
</xsl:when><xsl:otherwise>,"open"<xsl:text>
</xsl:text></xsl:otherwise></xsl:choose>
 </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

CodePudding user response:

A simple strip-space e.g.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:strip-space elements="*"/>
<xsl:output method="text" />
<xsl:template match="/">
 <xsl:for-each select="osm-notes/note"><xsl:value-of select="@id"/>,<xsl:value-of select="@lat"/>,<xsl:value-of select="@lon"/>,"<xsl:value-of select="@created_at"/>",<xsl:choose><xsl:when test="@closed_at != ''">"<xsl:value-of select="@closed_at"/>","close"
</xsl:when><xsl:otherwise>,"open"<xsl:text>
</xsl:text></xsl:otherwise></xsl:choose>
 </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

might help create a tree with less memory, on my machine Saxon HE 11.4 reports "Memory used: 4967Mb" and "Execution time: 19.101996s (19101.996ms)".

Now compare that to Saxon EE 11.4 and streaming

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes"/>
<xsl:strip-space elements="*"/>
<xsl:output method="text" />
<xsl:template match="/">
 <xsl:for-each select="osm-notes/note"><xsl:value-of select="@id"/>,<xsl:value-of select="@lat"/>,<xsl:value-of select="@lon"/>,"<xsl:value-of select="@created_at"/>",<xsl:choose><xsl:when test="@closed_at != ''">"<xsl:value-of select="@closed_at"/>","close"
</xsl:when><xsl:otherwise>,"open"<xsl:text>
</xsl:text></xsl:otherwise></xsl:choose>
 </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

and the memory used drops to "Memory used: 196Mb" and with less time "Execution time: 16.3387564s (16338.7564ms)".

It seems using xsl:iterate and xsl:value-of separator reduces the memory footprint with streaming even more ("Memory used: 111Mb"):

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="#all">

<xsl:mode streamable="yes"/>

<xsl:output method="text"/>

<xsl:template match="/">
  <xsl:iterate select="osm-notes/note">
    <xsl:value-of 
      select="@id, 
              @lat, 
              @lon, 
              '&quot;' || @created_at || '&quot;', 
              if (@closed_at != '') 
              then ('&quot;' || @closed_at || '&quot;', '&quot;close&quot;') 
              else '&quot;open&quot;'"
      separator=","/>
    <xsl:text>&#10;</xsl:text>
  </xsl:iterate>
</xsl:template>

</xsl:stylesheet>

Your second stylesheet converted to XSLT 3 and to use streaming is

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes"/>
<xsl:output method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/">
 <xsl:for-each select="osm-notes/note">
 <xsl:variable name="note_id"><xsl:value-of select="@id"/></xsl:variable>
  <xsl:for-each select="comment">
<xsl:choose> <xsl:when test="@uid != ''"> <xsl:copy-of select="$note_id" />,'<xsl:value-of select="@action" />','<xsl:value-of select="@timestamp"/>',<xsl:value-of select="@uid"/>,'<xsl:value-of select="replace(@user,'''','''''')"/>'<xsl:text>
</xsl:text></xsl:when><xsl:otherwise>
<xsl:copy-of select="$note_id" />,'<xsl:value-of select="@action" />','<xsl:value-of select="@timestamp"/>',,<xsl:text>
</xsl:text></xsl:otherwise> </xsl:choose>
  </xsl:for-each>
 </xsl:for-each>
</xsl:template>
</xsl:stylesheet>

and consumes only "Memory used: 218Mb" with Saxon EE that way.

CodePudding user response:

Streaming is probably the right approach here, as Martin suggests. Another, more brute force option, might be to pre-process the huge XML document to break it into a large number of much smaller documents and then selectively process those. Whether that's practical, of course, depends on what you want to do with the notes and whether or not you can limit yourself to only the ones you.

  • Related