I am parsing an ARXML file using Python and the library xml.etree.ElementTree. It reads everything but standalone closing tags. I need to be able to read closing tags, because there may be comments scattered throughout the file. My goal is to find exactly where these comments are in the ARXML file because they need to be copied to another converted file. So, it's important that I can determine when a closing tag has been encountered (and the comment that may appear after it), so that I know where exactly this comment is (which node is it inside).
This is a good example of what I am parsing:
<item>
<name>
</name> <-- Name module ends here -->
</item> <-- Item1 ends here -->
I read that it is possible to check if something is a closing tag by seeing if the node.text
is None. If it is, then it is a closing tag. However, this only works with closing tags in this format: <item name="Pizza" />
. Self-closing tags.
This does not work with just closing tags, such as </item>
, </a>
.
Is there a workaround or method to read these closing tags as well? So far, I am using ElementTree and iterating through the root of the document using for child in root.iter()
.
CodePudding user response:
By the time the DOM has been built, the closing tags are not present. They are serialization artifacts only and not part of the DOM.
From a cursory reading of the current docs, it appears that ElementTree does not have an option for keeping comment nodes when parsing. Curiously, you can create comment nodes via the API, and they will be serialized. But when parsing XML it discards comments.
So it looks like the best option may be SAX (event-based) parsing where you get a callback for every event, including start and end tags. This is a bit more complex because it's not always intuitive what constitutes an "event". For example, text nodes may be presented as multiple separate events, which you have to accumulate yourself. Python has the xml.sax
module.