I'm trying to extract posts from the stackoverflow data dumps. These are composed of large .xml files, notably a file Posts.xml
of 90GB.
Unfortunately, the following line uses up to 400GB of RAM, then the usage suddenly decreases to 200GB, but the process seems to hang and never return.
with open("Posts.xml", "r") as f:
posts = BeautifulSoup(f, features="html.parser") # hangs
for x in posts.findAll("row"):
if post["posttypeid"] == "1": # if it's a question
print(x["body"])
I tried different parsers, like lxml
, but this does not seem to help. Since what I do after this is just enumerate over posts.findAll("row")
to parse the content of each row, I was wondering whether this is a way to do this without loading the entire file in memory.
CodePudding user response:
You can use a SAX parser for this, like so:
from xml import sax
from xml.sax.handler import ContentHandler
class PostDescriber(ContentHandler):
def startElement(self, name, attrs):
if name == 'row' and attrs.get('PostTypeId') == '1':
print(attrs.get('Body'))
sax.parse('Posts.xml', PostDescriber())
A SAX parser is a streaming parser; it will call into your handler describing the things it sees (like beginnings and ends of elements) as it parses the file. Since it doesn't have to remember the structure of the file, it doesn't use a lot of memory.
If you want to do something more interesting with the individual question bodies rather than just printing them, those are small enough that you could just pass each to BeautifulSoup in turn.
The xml.sax
package is part of the Python standard library; it is described in the documentation and there are more methods on ContentHandler you can override if needed. For XML documents from untrusted sources, the docs recommend the defusedxml.sax package be used instead.