Home > Software design >  How to stop SAX parsing?
How to stop SAX parsing?

Time:09-06

I am using a SAX parser (xml.sax) and it works how I want to. However, I am parsing quite a large file (hence why I use SAX) and I would like to stop parsing at some point (e.g. when I reached a certain limit, or when I found a certain piece of data).

class ProductHandler(xml.sax.ContentHandler):
  def startElement(self, tag, attrs):
    [.. process start ..]

  def endElement(self, tag):
    [.. process end ..]

  def characters(self, content):
    [.. process characters ..]

product_handler = ProductHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(product_handler)
parser.parse(xmlfile)

How do I do that? Is there a certain return value I can return at one of the handler methods? I checked the documentation, but couldn't find anything in this direction.

CodePudding user response:

I thought I would flesh out that comment a bit.

Using this example data, if we want to find a <description> that contains the word "sourdough", maybe we would write something like this:

import xml.sax


class IAmAllDone(Exception):
    pass


class ProductHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        super().__init__()

        self.description = None
        self.name = None
        self.tree = []

    def startElement(self, name, attrs):
        self.tree.append(name)

    def endElement(self, name):
        self.tree.pop(0)

    def characters(self, content):
        if self.tree[-1] == "name" and content.strip():
            self.name == content
            print("name:", content)
        elif self.tree[-1] == "description" and "sourdough" in content:
            self.description = content
            raise IAmAllDone()


product_handler = ProductHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(product_handler)
try:
    parser.parse("data.xml")
except IAmAllDone:
    pass

if product_handler.description is not None:
    print("found description:", product_handler.description)

The above will output:

name: Belgian Waffles
name: Strawberry Belgian Waffles
name: Berry-Berry Belgian Waffles
name: French Toast
found description: Thick slices made from our homemade sourdough bread

As you can see, we stop the SAX parsing before reading the final "Homestyle Breakfast" item.

  • Related