I want to print the text between a particular tag in an XML file using SAX.
However, some of the text output consist of spaces or a newline character.
Is there a way to just pick out the actual strings? What am I doing wrong?
See code extract and XML document below.
(I get the same effect with both Python 2 and Python 3.)
#!/usr/bin/env python3
import xml.sax
class MyHandler(xml.sax.ContentHandler):
def startElement(self, name, attrs):
self.tag = name
def characters(self, content):
if self.tag == "artist":
print('[%s]' % content)
if __name__=='__main__':
parser=xml.sax.make_parser()
Handler=MyHandler()
parser.setContentHandler(Handler) #overriding default ContextHandler
parser.parse("songs.xml")
<?xml version="1.0"?>
<genre catalogue="Pop">
<song title="No Tears Left to Cry">
<artist>Ariana Grande</artist>
<year>2018</year>
<album>Sweetener</album>
</song>
<song title="Delicate">
<artist>Taylor Swift</artist>
<year>2018</year>
<album>Reputation</album>
</song>
<song title="Mrs. Potato Head">
<artist>Melanie Martinez</artist>
<year>2015</year>
<album>Cry Baby</album>
</song>
</genre>
CodePudding user response:
If you want to use SAX then you need a solid understanding of the XML specification. The technical name for the white space is 'mixed content'. It occurs before the first child tag, between child tags and after the final child tag. Most XML processors will report SAX events for mixed content. Some have a flag for suppressing it (because many applications are only interested in text-only content or element-only content).
Solutions include:
a) Stop using SAX. DOM would be a lot more straightforward
b) Add code to detect the startElement and endElement events for the tag(s) that you're interested in. Ignore events unless you're inside one of your 'interesting' tags.
c) use XSLT to turn your XML document into whatever form you require (see How to transform an XML file using XSLT in Python?)
My choice would always be c) because XSLT is a superpower, and it makes this type of task very simple.
CodePudding user response:
The value of self.tag
is set to "artist" when the <artist>
start tag is encountered, and it does not change until startElement()
is called for the <year>
start tag. Between those elements is some uninteresting whitespace for which SAX events are also reported by the parser.
One way to get around this is to add an endElement()
method to MyHandler
that sets self.tag
to something else.
def endElement(self, name):
self.tag = "whatever"